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Abstract 

Principal component analysis (PCA) is a classical dimension reduction method which projects 
data onto the principal subspace spanned by the leading eigenvectors of the covariance matrix. 
However, it behaves poorly when the number of features p is comparable to, or even much larger 
than, the sample size n. In this paper, we propose a new iterative thresholding approach for 
estimating principal subspaccs in the setting where the leading eigenvectors are sparse. Un- 
der a spiked covariance model, we find that the new approach recovers the principal subspace 
and leading eigenvectors consistently, and even optimally, in a range of high-dimensional sparse 
settings. Simulated examples also demonstrate its competitive performance. 

Keywords: dimension reduction, high-dimensional statistics, principal component analysis, 
principal subspace, sparsity, spiked covariance model, thresholding 

1 Introduction 

In many contemporary datasets, if we organize the p-dimensional observations xi, . . . ,Xn, into the 
rows of an n X p data matrix X, the number of features p is often comparable to, or even much 
larger than, the sample size n. For example, in biomedical studies, we usually have measurements 
on the expression levels of tens of thousands of genes, but only for tens or hundreds of individuals. 
One of the crucial issues in the analysis of such "large p" datasets is dimension reduction of the 
feature space. 



As a classical method, principal component analysis (PCA) [2 a . Il0l | reduces dimensionality 
by projecting the data onto the principal subspace spanned by the m leading eigenvectors of the 
population covariance matrix S, which represent the principal modes of variation. In principle, 
one expects that for some m < p, most of the variance in the data is captured by these m modes. 
Thus, PCA reduces the dimensionality of the feature space while retaining most of the information 
in data. In addition, projection to a low-dimensional space enables visualization of the data. In 
practice, S is unknown. Classical PCA then estimates the leading population eigenvectors by those 
of the sample covariance matrix S. It performs well in the traditional data setting where p is small 
and n is large [3]. 
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In high-dimensional settings, a collection of data can be modeled by a low-rank signal plus noise 
structure, and PCA can be used to recover the low-rank signal. In particular, each observation 
vector Xi can be viewed as an independent instantiation of the following generative model: 

Xi = IJ + Aui + azi. (1.1) 

Here, /i is the mean vector, A is a p x fh deterministic matrix of factor loadings, Ui is an m-vector 
of random factors, o" > is the noise level, and Zi is a p-vector of white noise. For instance, 
in chemometrics, Xi can be a vector of the logarithm of the absorbance or reflectance spectra 
measured with noise, where the columns of A are characteristic spectral responses of different 



chemical components, and «i's the concentration levels of these components j33l |. Here, the number 
of observations are relatively few compared with the number of frequencies at which the spectra are 
measured. In econometrics, Xi can be the returns for a collection of assets, where the Uj's are the 
unobservable random factors [HI. The assumption of additive white noise is reasonable for asset 
returns with low frequencies (e.g., monthly returns of stocks). Here, people usually look at tens 
or hundreds of assets simultaneously, while the number of observations are typically also at the 
scale of tens or hundreds. In addition, model (jl.ip also represents a big class of signal processing 



problems [3j]. Without loss of generality, we assume ^ = from now on. 

In this paper, our primary interest lies in PCA of high-dimensional data generated as in (jl.ip . 
Let the covariance matrix of n, be and suppose that A has full column rank and Ui and Zi are 
independent. Then, the covariance matrix of Xj becomes 

rh 

J: = A<^A' + a^I = ^X]qjq'j+a^I. (1.2) 

Here, Af > ••• > > are the eigenvalues of A^A', with qj, j = l,...,m, the associated 
eigenvectors. Therefore, the j-th eigenvalue of E is X'j+a^ for j = 1, . . . , m, and otherwise. Since 
there are fh spikes (Af , . . . , A^) in the spectrum of E, (11. 2p has been called the spiked covariance 
model in the literature [l^]- For data with such a covariance structure, it makes sense to project the 
data onto the low-dimensional subspaces spanned by the first few qj^s. Here and after, m denotes 
the number of spikes in the model, and m is the target dimension of the principal subspace to be 
estimated. 

Classical PCA encounters both practical and theoretical difficulties in high dimensions. On the 
practical side, the eigenvectors found by classical PCA typically involve all the p features, which 
makes their interpretation challenging. On the theoretical side, the sample eigenvectors are no 
longer always consistent estimators of their population counterparts. Sometimes, they can even 
be nearly orthogonal to the target direction. When both n,p ^ oo with n/p — )• c G (0, oo), at 
different levels of rigor and generality, this phenomenon has been examined by a number of authors 
23, [m, H, 0, l2ll | under model (|1.2p . See jl6l | for similar results in the case where p — t- oo and 



n is fixed. 

In recent years, to facilitate interpretation, researchers have started to develop sparse PCA 
methodologies, where they seek a set of sparse vector s sp anning the low-dimensional subspace 
that explains most of the variance. See, for example, 15, [stI . [3, 32, [s^. These approaches 



typically start with a certain optimization formulation of PCA and then induce a sparse solution 
by introducing appropriate penalties or constraints. 

On the other hand, when E indeed has sparse leading eigenvectors (perhaps in some transform 
domain), it becomes possible to estimate them consistently under high-dimensional settings via 
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new estimation schemes. For example, under normality assumption, when S only has a single 
spike, i.e., when m = 1 in (II. 2p . Johnstone and Lu l4] proved consistency of PCA obtained on a 
subset of features with large sample variances when the leading eigenvalue is fixed and logp/n — t- 0. 
Under the same single spike model, if in addition the leading eigenvector has exactly k non-zero 
loadings, Amini and Wainwright [l[ studied conditions for recovering the non-zero locations using 
the methods in 141 and [3], and Shen, et al 28] established conditions for consistency of a sparse 



PCA method in [29(] when p — )• cx) and n is fixed. For the more general multiple component case, 
Paul and Johnstone [2J] proposed an augmented sparse PCA method for estimating each of the 
leading eigenvectors, and showed that their procedure attains near optimal rate of convergence 
under a range of high-dimensional sparse settings when the leading eigenvalues are comparable and 
well separated. Notably, these methods all focus on estimating individual eigenvectors. 

In this paper, we focus primarily on finding principal subspaces of S spanned by sparse leading 
eigenvectors, as opposed to finding each sparse vector individually. One of the reasons is that 
individual eigenvectors are not identifiable when some of the leading eigenvalues are identical or 
close to each other. In addition, if we view PCA as a dimension reduction technique, it is the 
low-dimensional subspace onto which we project data that is of the greatest interest. 

We propose a new iterative thresholding algorithm to estimate principal subspaces, which is 
motivated by the orthogonal iteration method in the matrix computation literature. In addition to 
the usual steps of orthogonal iteration, an additional thresholding step is added to seek a sparse 
basis for the subspace. When S follows the spiked covariance model, the algorithm is shown to yield 
a uniformly consistent subspace estimator over a wide range of high-dimensional sparse settings, and 
the rate of convergence is given under an appropriate loss function. Moreover, for any individual 
leading eigenvector whose eigenvalue is well separated from the rest of the spectrum, our algorithm 
also yields an eigenvector estimator which attains near optimal rate of convergence. In addition, 
the algorithm also has appealing model selection property and computational efficiency. 

The contribution of the current paper is twofold. First, we propose to estimate principal sub- 
spaces. This is natural for the purpose of dimension reduction and visualization, and avoids the 
identifiability issue for individual eigenvectors. Second, we construct a new algorithm to estimate 
the subspaces, which is theoretically justified under an informative model, efficient in computation, 
and easy to implement. 

The rest of the paper is organized as follows. In Section [21 we frame the principal subspace 
estimation problem and propose the iterative thresholding algorithm. The statistical properties 
and computational complexity of the algorithm are examined in Sections [3] and H] under normality 
assumption. Simulation results in Section [5] demonstrate its competitive performance. Section [6] 
presents the proof of the main theoretical results. 

Reproducible code: The Matlab package SPCALab implementing the proposed method and 
producing the tables and figures of the current paper is available at the author's website. 



2 Methodology 
2.1 Notation 

We use ||a;||2 to denote the Euclidean norm of a vector x. For an m x n matrix A, its submatrix with 
rows indexed by / and columns indexed by J is denoted by Ajj. If / or J includes all the indices, 
we replace it with a dot. For example, Aj. is the submatrix of A with rows in / and all columns. 
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The spectral norm of A is \\A\\ = max|j^||2=i ||^x||2, and the range, i.e., the column subspace, of A is 
ran(A). If m > n, and the columns of A form an orthonormal set in M™, we say A is orthonormal. 

We use C, Co,Ci, etc. to represent constants, though their values might differ at different 
occurrences. For real numbers a and 6, let a V 6 = max(a,6) and a Ab = min(a,6). We write 
ci-n = 0{bn), if there is a constant C, such that |a„| < C6.„ for all n, and a„ = o(6„) if an/bn — )• 
as n — >• oo. Throughout the paper, we use v as the generic index for features, i for observations, j 
for eigenvalues and eigenvectors, and k for iterations in the algorithm to be proposed. 



2.2 Framing the Problem: Principal Subspace Estimation 

When the covariance matrix S follows the spiked covariance model (jl.2p . its j-th largest eigenvalue 
= A| + fj^ for j = 1, . . . ,m, and equals o"^ for all j > rh. Let span{-} denote the linear 
subspace spanned by the vectors in the curly brackets. If for some m < rh, imC^) < ^m+i(S), the 
principal subspace 

Vm = span{gi, . . . ,qm} 

is well defined, regardless of the behavior of the other ^j(S)'s. Therefore, it is an identifiable object 
for the purpose of estimation. Note that Vfn is always identifiable, because im(^) > ^m+iC^)- The 
primary goal of this paper is to estimate the principal subspace Vm, when the target dimension m 
is given. 

We do not always aim at Vm directly for two reasons. First, the number of factors m is usually 
not known a priori and needs to be estimated. So, always aiming for Vm can be too ambitious. 
Second, under certain circumstances, one might only be interested in the first several principal 
subspaces. For example, to visualize the data, one typically only need V2 or V3. 

To measure the accuracy of an estimator for a subspace, consider a subspace S and an estimator 
S such that dim(5) = dim(5). Note that for each linear subspace, there is a unique orthogonal 
projection matrix with that subspace as its range. Let P and P be the projection matrices associated 
with S and S, respectively. The distance between S and S is given by the spectral norm of the 
difference between P and P: dist(5,5) = ||P — See, for example, Sec. 2.6.3 in Thus, we 
can define a loss function by the squared distances between S and S: 

L{S,S) = dist2(5,5) = \\P-Pf. (2.1) 

By definition, this loss function measures the maximum possible discrepancy between the projec- 
tions of any unit vector onto the two subspaces. The loss ranges in [0, 1], and equals zero if and only 
if 5 = 5. Geometrically, it equals the squared sine of the largest canonical angle between S and 



S [30l . Theorem 5.5]. Throughout the paper, we use the loss function (j2.ip for principal subspace 



estimation. 



2.3 Orthogonal Iteration 

Given a positive definite matrix A, a standard technique to compute its leading low-dimensional 
eigenspace is orthogonal iteration [9t]. In the special case where only the first eigenvector is sought, 
it is also known as the power method. 

Suppose that A is p x p, and we want to compute its leading eigenspace of dimension m. 
Starting with a p x m orthonormal matrix Q^^\ orthogonal iteration generates a sequence oi pxm 
orthonormal matrices Q^^\ k = 1,2,..., by alternating the following two steps till convergence: 
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Algorithm 1: ITSPCA (Iterative thresholding sparse PC A) 
Input: 

1. Sample covariance matrix S; 

2. Target subspace dimension m; 

3. Thresholding function r], and threshold levels 7„j, j = 1, . . . , m; 

4. Initial orthonormal matrix Q^^\ 

Output: Subspace estimator Vm = ran where denotes the matrix at 

convergence. 

1 repeat 

2 Multiplication: T^^^ = {t^^^) = SQ^^'^^] 

3 Thresholding: f C^) = {t^^), with t^^ = Tnj); 

4 QR factorization: Q'^'^^B''^ = f 

5 until convergence; 



(1) Multiplication: T^^) = AQ^^^-^); 

(2) QR factorization: QC^)/?^ = T^^). 

Denote the orthonormal matrix at convergence by Q^°°\ Then its columns are the leading eigen- 
vectors of A, and ran(Q(°°)) gives the eigenspace. In practice, one terminates the iteration once 
Ta,n{Q^^^) stabilizes. 

Orthogonal iteration is useful for computing low-dimensional eigenspaces of high-dimensional 
matrices. However, when applied directly to a sample covariance matrix S, it gives the classical PCA 
result, which could be problematic in high dimensions. Observe that all the coordinates are included 
in orthogonal iteration. When the number of coordinates is large, not only the interpretation is hard, 
but the variance accumulated across all the coordinates becomes so high that it makes consistent 
estimation impossible. 

If the eigenvectors spanning T^tji are sparse, one sensible way to reduce estimation error is to 
focus only on those coordinates at which the leading eigenvectors have large values, and to estimate 
other coordinates by zeros. Of course, one introduces bias this way, but hopefully it is much smaller 
compared to the amount of variance thus reduced. 

The above heuristics motivate us to develop in the next subsection an estimation scheme which 
incorporates this coordinate screening idea, while at the same time retains the simplicity of orthog- 
onal iteration. 

2.4 Iterative Thresholding Algorithm 

An effective way to incorporate coordinate screening into orthogonal iteration is to 'kill' small 
coordinates of the T^^^ matrix after each multiplication step, which leads to the iterative threshold- 
ing scheme summarized in Algorithm [TJ Although the later theoretical study is conducted under 
normality assumption, Algorithm [T] itself is not confined to normal data. 

In addition to the two basic steps in orthogonal iteration. Algorithm [1] adds a thresholding 
step in between them, where we threshold each element of T^''^ with a user-specified thresholding 
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Algorithm 2: DTSPCA (Diagonal thresholding sparse PCA) 



Input: 

1. Sample covariance matrix S; 

2. Target subspace dimension m; 

3. Diagonal thresholding parameter a„. 

Output: Orthonormal matrix Q^^\ 

1 Variance selection: select the set B of coordinates (which are likely to have "big" signals): 

2 Reduced PCA: compute m leading eigenvectors, Qi , . . . ,q^, of the submatrix Sbb', 

3 Zero-padding: construct Q^^^ = . . . , qin^] such that, 

^0) AO) n ■ 1 



function ij which satisfies for all t and all 7 > 0, 

|r/(t,7)-t| <7, and r/(i, 7)l(|t|<^) = 0. (2.2) 

Here, 1(^e) denotes the indicator function of an event E. We note that both the hard-thresholding 
function r///(t,7) = and the soft-thresholding function ?/s(*)7) = sgn(t)(|t| — 7)+ satisfy 

condition (j2.2p . So does any r/ that is sandwiched by them, such as the thresholding function 
resulting from a SCAD criterion In r/(t,7), the parameter 7 is called the threshold level. In 
Algorithm [H for each column of T^^\ a common threshold level "ynj needs to be specified for all its 
elements. These threshold levels remain unchanged across iterations. 

Remark 2.1. The ranges of QC^) and f (*^) are the same, because QR factorization only amounts 
to a basis change within the same subspace. However, as in orthogonal iteration, the QR step is 
essential for numerical stability, and should not be omitted. Moreover, although the algorithm is 
designed for subspace estimation, the column vectors of Q^°^') can be used as estimators of leading 
eigenvectors under sparsity assumption. 



Construction of Q^^^ Note that Algorithm [T] requires an initial orthonormal matrix Q^^\ It can 
be generated by the 'diagonal thresholding' sparse PCA algorithm proposed by Johnstone and Lu 



ij] . The multiple eigenvector version of their proposal is summarized in Algorithm [2j When cr^ is 
unknown, we could replace it by an estimator in the definition of B. For example, Johnstone 
and Lu [M] suggested 

1 " 

= median(— ''^^xf^) (2.3) 

^ 1=1 

for normal data. When available, subject knowledge could also be incorporated into the construc- 
tion of Q(°). 
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Bibliographical note In the special case of m = 1, Algorithm [T] is similar to the sPCA-rSVD 
algorithm by Shen and Huang [29(] and the SPC algorithm by Witten, et al [36[]. When m > 1, both 
the sPCA-rSVD and the SPC methods propose to iteratively find the first leading eigenvectors of 
residual covariance matrices, which becomes different from our approach. 



3 Statistical Properties 

This section is devoted to analyzing the statistical properties of Algorithm [1] under normality 
assumption. The main results include rate of convergence for principal subspace estimation and 
a correct exclusion property. We also investigate the convergence rate for individual eigenvector 
estimation under appropriate conditions. 

Assume that the observation vectors xi, . . . ,Xn are i.i.d. Np{0, S) distributed, with S following 
the spiked covariance model (11. 2p . Further assume that o"^ is known - though this assumption 



could be removed by estimating a using, say, a in (j2.3|) . Since one could always scale the data 
first, we assume cr^ = 1 from now on. Thus, (jl.ip reduces to the following orthogonal factor form 



,n. 



(3.1) 



Here, Vij are i.i.d. standard normal random factors, which are independent of the i.i.d. white noise 
vectors Zi ~ A'p(0, 1), and {gj, 1 < j < ffi} is a set of leading eigenvectors of S. 

Throughout the section, the target subspace dimension is assumed to be pre-specified by some 
m < rh, and both m and fh remain unchanged as p grows. Here and after, let pn = p\/n, because we 
are mostly concerned with settings where p is greater than n and scales at certain rate as n — t- oo. 
For concreteness, the initial orthonormal matrix Q^^^ is always obtained via Algorithm [2] with 



a 



log(Pr, 



n 



1/2 



(3.2) 



for constructing the set B in step 1. In addition, the threshold levels in Algorithm [T] are set at 



Inj = 7 



ijiSBB) 



iog{pr. 



n 



1/2 



, j = l,...,m. 



(3.3) 



Here, a and 7 are user specified constants, and ijiSsB) is the j-th largest eigenvalue Sbb, where 
the set B is obtained in step 1 of Algorithm [21 



3.1 A Special Case 

To facilitate understanding, we first state the rate of convergence result for principal subspace 
estimation in a special case. 

Consider the asymptotic setting where n — )■ 00 with p > n and logp/n — )• 0, while the spikes 
Af > • • • > > X'^+i ^ ■ ■ ■ '^fh > remain unchanged. Suppose that the qj's are sparse in the 
sense that, for some r G (0,2), the ir norm of the eigenvectors are uniformly bounded by s, i.e., 
WljWr = (Sj=i iQfjl^)^^^' ^ for j = 1, . . . ,m. Here, s > 1 is a constant not depending on p. 

Under the above setup, for h(x) = x^ jix + 1), we have the following upper bound on the rate 
of convergence for subspace estimation error. 
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Theorem 3.1. Under the above setup, for sufficiently large constants a and 7 in (j3.2p and (|3.3p . 
there exist constants Cq, Ci = Ci(7,r, m) and C2, such that for sufficiently large n, with probability 
at least 1 — Cop~'^, the subspace estimator Vm^ = Tan{Q^^^) in the k-th iteration of Algorithmic 
satisfies 



sup L{Vm,Vi^)<Cims'- 

lkjl|r<S,Vj 



logp 



-r/2 



+ C2 5fm(A) 



logp 



n 



for all k = K, ... ,2K , where K = 0{\ogn) and g„i{X) = 1x2 _\2 « • 

The upper bound in Theorem 13.11 consists of two terms. The first term is a "nonparamet- 
ric" term, which can be decomposed as the product of two components. The first component, 
TTis'' [n/i( )/ log p]''/^, up to a multiplicative constant, bounds the number of coordinates used in 
estimating the subspace, while the second component, logp/[n/i(A^)], gives the average error per 
coordinate. The second term in the upper bound, gm(A) logp/n, up to a logarithmic factor, has the 
same form as the cross-variance term in the "fixed p, large n" asymptotic limit for classical PCA 
estimator (cf., 0, Theorem 1]). We could call it a "parametric" error term, because it always arises 
when we try to separate the first m eigenvectors from the rest, regardless of how sparse they are. 
Under the current setup, both terms in the upper bound converge to as n — >■ 00, which establishes 
the consistency of our estimator. 

To better understand the upper bound, we compare it with some lower bound result. Suppose 
Af > A^- Consider the simplest case where m = 1. Then, estimating Vi is the same as estimat ing 



the first eigenvector qi. For estimating any individual eigenvector gj, Paul and Johnstone |24l | 



considered the loss function /(gj, qj) = \\qj — sgn(g^gj)gj|||. In the special case here, the A|'s, s and 
r G (0,2) are fixed and p > n, so when n is large, s'^[nh{)^) / log pY^"^ < Cp^^^ for some c G (0, 1). 



For this case. Theorem 2 in 2j] asserts that for any estimator gi, 



sup E/((7i,^i) > Cis'^' 

lkjl|r<S,Vj 



logp 
nh{\l) 



l-r/2 



+ O2 • 



n 



Let Vi = span{§i}. We have \l{qi,qi) < L{Vi,Vi) < l{qi,qi). So the above lower bound also 
holds for any Vi and KL{'Pi,Vi). Therefore, Theorem 13 . 1 1 shows that the estimator from Algorithm 
[1] is near optimal, since it attains the minimax lower bound within a logarithmic factor. 

The theorem also states that the estimator could be obtained in O(logn) iterations, and the 
upper bound holds for all thresholding function rj satisfying (j2.2p . Last but not least, since the 
choices of a„ and ^nj do not involve any unknown parameter, the theorem establishes the adaptivity 
of our estimator: the near optimal rate of convergence is obtained without any knowledge of the 
power r, the radius s or the spikes Xj. 

Remark 3.1. Although special, the case considered here is still reasonable for a number of settings. 
For example, the settings of numerical experiments in Section [5l 

Later in Section 13.41 Theorem 13.21 establishes analogous rate of convergence result, but for 
a much wider range of high-dimensional sparse settings. In particular, the above result will be 
extended simultaneously along two directions: 

1. The spikes Af , . . . , A^ will be allowed to scale as n — )• 00, and X^+i^ ■ ■ ■ could even be of 
smaller order as compared to the first m spikes; 
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2. Each individual eigenvector qj will be constrained to a weak-ij. ball of radius Sj (which contains 
the ir ball of the same radius), and the radii Sj's will be allowed to diverge as n — t- oo. 

In what follows, we first state precise assumptions for the aforementioned extension in Section 
13.21 followed by the introduction of a few key quantities in Section 13.31 Finally, the main results 
are stated in Section! 



3.2 Assumptions 

Here, we state assumptions for the general theoretical results in Section [37 

As outlined above, the first extension of the special case is to allow the spikes = A^(n) > to 
depend on n, though the dependence will usually not be shown explicitly. Recall that pn = p\/ n, 
we impose the following growth rate condition on p and the A^'s. 

Condition GR. As n ^ oo, we have 

1. the dimension p satisfies logp/n = o(l); 

2. the largest spike Af satisfies log(Af) = 0(log(p„)); the smallest spike A^ satisfies log{pn) = 
o(nA^); and their ratio satisfies Af/A^ = 0(ra[log(p„)/n]^/^"'"''/^). 

The first part of condition GR requires the dimension to grow at a sub-exponential rate of the 
sample size. The second part ensures that the spikes grow at most at a polynomial rate with p„, 
and are all of larger magnitude compared to y^log(p„,)/n. In addition, the condition on the ratio 
Af/A^ allows us to deal with the interesting cases where the first several spikes scale at a faster 



rate with n than the others. This is more flexible than the assumption previously made in [2j] that 
all the spikes grow at the same rate. 

Turn to the sparsity assumption on the g^-'s. We first make a mild extension from iy. ball to weak- 
er- ball 0]. To this end, for any p- vector u, order its coordinates by magnitude as > • • • > 
We say that u belongs to the weak-^^ ball of radius s, denoted hy u £ wir{s), if 

\u\(^^) < for aU V. 

For r G (0,2), the above condition implies rapid decay of the ordered coefficients of u, and thus 
describes its sparsity. For instance, consider u = {\/^/k, . . . , 0, . . . , 0)' with exactly k non- 

zero entries all equal to Xjyfk. Then, for fixed r G (0,2), we have u G wir{k^^^^^^'^). In particular, 
when k = 1, u £ w£r{l)- Note that weak-^r- ball extends ir ball, because ||u||r < s, i.e., u G ir{s), 
implies u G wiris)- 

In what follows, we assume that for some fixed r G (0,2) and all j < m, qj G wir{sj) for 
some Sj > 1. We choose to use the notion of "weak-^^ decay", because it provides a unified 
framework for several different notions of sparsity, which is convenient for analyzing a statistical 
estimation problem from a minimax point of view 0]. Hence, at any fixed n, we will consider 
whether Algorithm [1] performs uniformly well on n i.i.d. observations Xi generated by (j3.ip whose 
covariance matrix T, belongs to the following uniformity class 

Tn = \ ^pxp = ^ ^'jQjQj + 1 ■■ Qj & wir{sj)yj \. 

^ 7 = 1 ^ 
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For our general results, we allow the radii Sj's to depend on or even diverge with n, though we 
need to assume that they do not grow too rapidly, so the leading eigenvectors are indeed sparse. 
This leads to the following sparsity condition. 



Condition SP. As n ^ oo, the radius Sj of the weak-ir ball satisfies Sj > 1 and 

"log(p„)" 



l/2-r/4 

= o{l^\\), for j = l,...,fn. 



We note that this type of condition also appeared in previous investigation on estimating in- 



dividual eigenvectors in the multiple component spiked covariance model 2J]. The condition is, 



for example, satisfied if condition GR holds and the largest spike Af is bounded away from zero 
while the radii Sj's are all bounded above by an arbitrarily large constant. That is, if there exist a 
constant C > 0, such that Af > 1/C and Sj < C for all j < rh and all n. 

It is straightforward to verify that conditions GR and SP are satisfied by the special case in 
Section 13.11 We conclude this part with an example. 

Example. When each Xi collects noisy measurements of an underlying random function on a regu- 



larly spaced grid, model (|3.ip becomes discretization of a functional PC A model |26|], and the qj^s 
are discretized eigenfunctions. When the eigenfunctions are smooth or have isolated singularities 
either in themselves or in their derivatives, their wavelet coefficients belong to some weak ir ball 
0]. So do the discrete wavelet transform of the q/s. Moreover, the radii of the weak £r balls are 
determined by the underlying eigenfunctions and are thus uniformly bounded as the size of the grid 
p gets larger. In this case, condition SP is satisfied when condition GR holds and A^ is bounded 
away from zero. So, for functional data of this type, we could always first transform to the wavelet 
domain and then apply Algorithm [H 

3.3 Preliminaries 

We now introduce a few quantities which appear later in the general theoretical results. 

The first quantity gives the rate at which we distinguish high from low signal coordinates. Recall 
that h{x) = x^ l(x + 1). For j = 1, . . . ,m, define 



ilM^y 

According to Paul [i^], up to a logarithmic factor, r^^ can be interpreted as the average error per 
coordinate in estimating an eigenvector with eigenvalue A^ + 1. Thus, a coordinate can be regarded 
as of high signal if at least one of the leading eigenvectors is of larger magnitude on this coordinate 
compared to r„j. Otherwise, we call it a low signal coordinate. 

Given the cutoff rate r^j between high and low signals, we can define the set of high signal 
coordinates 

H = -ff(/3) = {v : \qvj\ > f^Tnj, for some I < j < m}. (3.5) 

Here, /3 is a constant that does not depend on n, the actual value of which will be specified later. 
In addition, we let L = {1, . . . ,p}\H be the complement of H. Here, H stands for "high", and L 
for "low" (also recall the set B in Algorithm [21 where B stands for "big"). The dependence of H, 
L and B on n is suppressed for notational convenience. 
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To understand the convergence rate of our subspace estimator, it is important to have an upper 
bound for caid{H), the cardinahty of H. As we shall show later, the upper bound is a constant 
multiple of 

j=l ^nj 

Here we use capital letter to indicate the order m < m < Mn- In the general result, M„ plays the 
same role as ms^[n/i(A^)/ logp]^'/^ plays in Theorem 13. 11 See the discussion following Theorem 13. 11 
The last quantity we introduce is related to the "parametric" term in the convergence rate. For 
J = 1, . . . , 771, define 

2 _ (AHl)(A,Vi + l)log(pO 

where A^^^^ = 0. So the second term of the upper bound in Theorem 13.11 is then C2e^^. For the 
interpretation of this quantity, see the discussion following Theorem 13.11 



3.4 Main Results 

We turn to the statement of main theoretical results. 

A key condition for our results is the asymptotic distinguishability (AD) condition introduced 
below. Recall that all the spikes A| (hence all the leading eigenvalues) are allowed to depend on n. 
The condition AD will guarantee that the largest few eigenvalues are asymptotically well separated 
from the rest of the spectrum, and so the corresponding principal subspace is distinguishable. 

Definition. We say that condition AD{j, k) is satisfied if there exists a numeric constant k, such 
that the gap between the j-th and the {j + l)-th eigenvalues satisfies 

X2 

A^ — ^"j+i > — , for all n. 

In Corollary 13.21 below, we will need a pair of conditions AD(j — 1,k) and AD(j, k) for some 
j < fh, hence we define AD(0, k) and AD(?7i, k) by letting Ag = oo, and A^_,_^ = 0. Therefore, 
AD(0,«;) always holds trivially. 

In the special case of Section [3Tl the spikes are all constants with A^ > X^+i- So condition 
AD(?Ti, k) is satisfied for any constant k > Af/(A^ — X'^_^_l). 



Rate of Convergence for Principal Subspace Estimation 

Recall that the thresholding parameters an and "fnj are specified in (j3.2p and (j3.3p . The follow- 
ing theorem establishes the rate of convergence of the principal subspace estimator obtained via 
Algorithm [1] under relaxed assumptions, which generalizes Theorem 13.11 

Theorem 3.2. Suppose conditions GR and SP hold, and condition AD{m, k) holds for the given 
subspace dimension m and some k. Let j3 = c/y/rn in the definition (j3.5p of H , and let a and 
7 > 7o(c) in (|3.2p and (j3.3p be sufficiently large. Then, there exist constants Cq, Ci = Ci{'j,r,m, k) 
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for all k = K, . . . , 2K, where 



K = 1.01k (^1 + ^ 



) 



( 



1 + 



log 2 



1 



) 



logn + Vlog/i(Ai) 



(3.9) 



Theorem 13.21 states that for appropriately chosen threshold levels and all thresholding function 
satisfying (|2.2p . after enough iterations, Algorithm [1] yields principal subspace estimators whose 
errors are, with high probability, uniformly bounded over J-'n by a sequence of asymptotically 
vanishing constants as n — )• oo. In addition, the probability that the estimation error is not well 
controlled vanishes polynomially fast. Therefore, the subspace estimators are uniformly consistent 
over J^n- 

The interpretation of the two terms in the error bound (j3.8p is similar to those in Theorem 13.11 
Having introduced those quantities in Section [3.31 we could elaborate a little more on the first, i.e., 
the "nonparametric" , term. By Theorem 13.31 below, when estimating Vm, Algorithm [1] focuses only 
on the coordinates in the set H, the cardinality of which is card{H) = 0{Mn). Since can be 
interpreted as the average error per coordinate, the total estimation error accumulated over all the 
coordinates in H is thus of order 0{MnT!^^). Moreover, as we will show later, the squared bias 
induced by focusing only on H is also of order 0{MnT^^). Thus, this term indeed comes from the 
bias- variance tradeoff of the nonparametric estimation procedure. The meaning of the second, i.e., 
the "parametric" , term is exactly the same as in Theorem 13.11 Finally, we note that both terms 
vanish as n — )• oo under conditions GR, SP and AD(m, k). 

The expression in (j3.9p implies that Algorithm [1] only needs a relatively small number of itera- 
tions to yield the desired estimator. In particular, when the largest spike Af is bounded away from 
zero, ()3.9p implies that K = O(logn) iterations suffice. 

The threshold levels a„ and 7„j in (13. 2p and (j3.3p does not depend on unknown parameters. 
Thus, similar to the special case, the estimation procedure is adaptive to a wide range of high- 
dimensional sparse settings. In the simulation studies, fixing a = 3 in (13. 2p and 7 = 1.5 in (13. 3p 
usually leads to satisfactory results. For details, see Section [3 

The result in Theorem 13.21 could also be extended to an upper bound for the risk. Note that 

= o(t^^ V e^^), and that the loss function (12. ID is always bounded above by 1. The following 
result is a direct consequence of Theorem 13.21 

Corollary 3.1. Under the setup of Theorem \3.^ we have 



Correct Exclusion Property 

We now switch to the model selection property of Algorithm [TJ By our discussion in Section [21 
an important motivation for the iterative thresholding procedure is to trade bias for variance by 
keeping low signal coordinates out of the orthogonal iterations. More specifically, it is desirable to 



supEL(n^,^i'^) < CiMnT^^ + Cse^ 
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restrict our effort to estimating those coordinates in H and simply estimating those coordinates in 
L with zeros. 

By construction, Algorithm [2] yields an initial orthonormal matrix with a lot of zeros, but 
Algorithm [T] is at liberty to introduce new non-zero coordinates. The following result shows that 
with high probability all the non-zero coordinates introduced are in the set H. 

Theorem 3.3. Under the setup of Theorem \3.2l with probability at least 1 — Cop~'^, for all k = 
0, . . . ,2K, the orthonormal matrix Q^^^ has zeros in all its rows indexed by L, i.e., = 0. 

We call the property in Theorem 13.31 "correct exclusion" , for it ensures that all the low signal 
coordinates in L are correctly excluded from iterations. In addition, Theorem 13.31 shows that our 
principal subspace estimator is indeed spanned by a set of sparse loading vectors, where all loadings 
in L are exactly zero. 



Rate of Convergence for Individual Eigenvector Estimation 

The primary focus of the current paper is on estimating principal subspaces. However, when an 
individual eigenvector, say qj, is identifiable, it may also be of interest to see whether Algorithm [1] 
could estimate it well. The following result shows that for large enough k, the j-th column of Q*-'^^ 
estimates qj well, provided that the j-th. eigenvalue is well separated from the rest of the spectrum. 

Corollary 3.2. Under the setup of Theorem \3.S\ if in addition for some j < m, both conditions 
AD{j — 1, k) and AD{j,K) hold, then with probability at least 1 — Cqp~'^, qj^\ the j-th column of 
Q^^\ satisfies 

supL(span{gj},span{g^''^}) < CiM„r^j- + C2{elj_i V elj). 

Moreover, supjr^ EL(span{(7j}, span{^'^^}), the supremum risk over Fn, is also bounded by the right 
side of the above inequality. 



Corollary 13.21 connects closely to the previous investigation by Paul and Johnstone [241 ] on 



estimating individual sparse leading eigenvectors. Recall their loss function l{qj,qj) = \\qj — 
sgn(qjqj)qj\\2, which is equivalent to the restriction of the loss function (j2.ip to one-dimensional 
subspaces, since ^l{qj,qj) < L(span{gj}, spanj^j}) < l{qj,qj). Then, Corollary 13.21 implies that 

supE l{qj,q) >) < CiMnT^j + C2 ^ _ ^ ( {2 _ ^2 ^^2 



n 



When the radii of the weak-^^ balls grow at the same rate, i.e., maxj Sj/ min^ Sj = 0(1), the 



upper bound in the last display matches the lower bound in Theorem 2 of 2J] up to a logarithmic 
factor. Thus, when the j-th. eigenvalue is well separated from the rest of the spectrum. Algorithm 
[T] yields a near optimal estimator of qj in the adaptive rate minimax sense, since the thresholds a„ 
and 7„j do not depend on the unknown parameters. 
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4 Computational Complexity 



In this section, we study the computational complexity of Algorithm [H Throughout, we assume 
the same setup as in Section [3j In addition, we restrict the calculation to the high probability event 
on which the conclusions of Theorems 13.21 and 13.31 hold. In what follows, for any matrix A, we use 
supp{^} to denote the index set of the rows of A which has at least one non-zero entry. 

Consider a single iteration, say, the k-th. In the multiplication step, the (i^, j)-th element ofT^^\ 
t^^j , comes from the inner product of the v-th. row of S and the j-th column of Q^^^^\ Though 
both are p- vectors. Theorem 13.31 asserts that for any column of Q^^^^\ at most card(i7) of its 
entries are non-zero. Thus, if we know suppjQ^'"'^^)}, then t^^^ could be calculated in 0(card(//)) 

flops, and T^'^^ in 0{mpcaid{H)) flops. Since supp{Q^'^~^)} can be obtained in 0{mp) flops, the 
multiplication step can be completed in 0(mp card(-ff)) flops. Next, the thresholding step performs 
elementwise operation on T^^\ and hence can be completed in 0{mp) flops. Finally, consider the 
QR step. First, we can obtain supplT^*^^} in 0{mp) flops. Then, QR factorization could be 
performed on the reduced matrix which only includes the rows in suppjT^'^)}. Since Theorem 13.31 
implies suppjr'''^-*} = suppjQ^'^^} C H, the complexity of this step is 0(m^card(-?/)). 

Since m = 0{p), the complexity of the multiplication step dominates the other two steps, and 
so the complexity of each iteration is 0(mp card (i/)). Theorem 13.21 shows that K iteration is 
enough to achieve the error bound in ()3.8p . Therefore, the overall complexity of Algorithm [1] is 
0{Kmp card{H)) . 

When the true eigenvectors are sparse, c&id{H) is of manageable size. Moreover, in many 
realistic situations, is bounded away from and so i^T = O(logn). For these cases. Algorithm [1] 
is scalable to very high dimensions. 

We conclude the section with a brief discussion on parallel implementation of Algorithm [TJ 
In the k-th iteration, both matrix multiplication and elementwise thresholding can be computed 
in parallel. For QR factorization, one needs only to communicate the rows of f with non-zero 
elements, the number of which is no greater than card(-ff). Thus, the overhead from communication 
is 0{mcaid{H)) for each iteration, and O (Km car d{H)) in total. When the leading eigenvectors 
are sparse, card(-ff) is manageable, and parallel computing of Algorithm [T] is feasible. 



5 Numerical Experiments 
5.1 Single Spike Settings 

We first consider single spike settings, where each observed data vector Xi is generated by (13. ip 
with fh = 1. Motivated by functional data with localized features, four different test vectors qi are 
considered, where qi = {f{l/p), . . . , /(p/p))', with / one of the four functions in Fig. [T] For each 
test vector, the dimension p = 2048, the number of observations n = 1024, and the spike value 
ranges in {100,25,10,5,2}. 

Before applying any sparse PCA method, we transform the observed data vectors into the 



wavelet domain using the Symmlet 8 basis 181] , and scale all the observations by a with given in 
()2.3p . The multi-resolution plots of wavelet coefficients of the test vectors are shown in Fig. [2l In 
the wavelet domain, the four vectors exhibits different levels of sparsity, with step the least sparse, 
and sing the most. 

Table [1] compares the average loss of subspace estimation by Algorithm [1] (ITSPCA) with 



14 



(b) Piecewise polynomial funclion 




(c) Three peak function 



°1 



(d) Single singularity function 



Figure 1: Four test vectors in the original domain: values at p = 2048 equispaced points on [0, 1] 
of four test functions, (a) step: step function, (b) poly: piecewise polynomial function, (c) peak: 
three-peak function, and (d) sing: single singularity function. 



several existing methods in the literature, including augmented sparse PCA (AUGSPCA) by Paul 
and Johnstone 24 1, correlation augmented sparse PCA (CORSPCA) by Nadler j2ol |. and diagonal 
thresholding sparse PCA (DTSPCA) given in Algorithm El For ITSPCA, we computed usmg 
Algorithm [2j The thresholds a„ and 7„i are specified by ()3.2p and (j3.3p with q = 3 and 7 = 
1.5. Moreover, we stopped the iteration once L(ran(QW),ran(Q('=+i))) < 10"^ To make fair 
comparison, parameters in the competing algorithms are all set to the values recommended by 
their authors. 

From Tabled! ITSPCA and CORSPCA outperform the other two methods for all test vectors 
and across different spike values. Between the two, CORSPCA only wins by small margin when 
the spike values are large. Otherwise, ITSPCA wins, sometimes with large margin. Moreover, at 
the same spike value and for the same algorithm, the sparser the signal, the smaller the estimation 
error. 

Table [T] also presents the average sizes of the sets of selected coordinates. While all the four 
methods yield sparse PC loadings, AUGSPCA and DTSPCA seem to select too few coordinates, 
and thus introduce too much bias. ITSPCA and CORSPCA apparently result in a better bias- 
variance tradeoff. 



5.2 Multiple Spike Settings 

Next, we simulated data vectors using model (j3.ip with m = 4. The qj vectors are taken to be the 
four test vectors used in single spike settings, in the same order as in Fig. [H up to orthonormal- 
izatior0. We tried four different configurations of the spike values [Xf, . . . , A|), as specified in the 
first column of Table [2l For each configuration of spike values, the dimension of data vectors is 
p = 2048, and the number of observations is n = 1024. 

^These four test vectors are carefully shifted such that the inner product of any pair is close to 0. Thus, the 
vectors after orthonormalization are visually indistinguishable from those in Fig. [T] 
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(a) DWT of step function (Symmlet 8). 



(c) DWT of 3-peak function (Symmlet 8). 



(b) DWT of piecewise polynomial function {Symmlet 8). 



(d) DWT of single singularity function (Symmlet 8). 



Figure 2: Discrete wavelet transform of the four test vectors in Fig. [TJ In each plot, the length 
of each stick is proportional to the magnitude of the Symmlet 8 wavelet coefficient at the given 
location and resolution level. 





ITSPCA 


AUGSPCA 


CORSPCA 


DTSPCA 


Test vector 




Loss Size 


Loss Size 


Loss Size 


Loss Size 


step 


100 
25 
10 

5 

2 


0.0061 114.2 
0.0224 76.3 
0.0470 53.4 
0.0786 45.5 
0.1921 25.4 


0.0096 96.5 
0.0362 55.4 
0.0710 37.4 
0.1370 23.7 
0.3107 11.4 


0.0055 120.1 
0.0236 73.9 
0.0551 45.9 
0.1119 28.7 
0.3846 15.2 


0.0275 66.6 
0.0777 38.3 
0.1494 24.1 
0.2203 17.1 
0.4518 9.7 


poly 


100 
25 
10 

5 

2 


0.0060 83.1 
0.0175 52.4 
0.0346 38.7 
0.0588 30.7 
0.1317 20.0 


0.0088 66.5 
0.0254 41.4 
0.0527 27.5 
0.0844 20.2 
0.2300 10.3 


0.0051 92.0 
0.0173 53.1 
0.0404 34.0 
0.0684 24.6 
0.2155 16.3 


0.0191 49.2 
0.0540 28.7 
0.0959 20.5 
0.1778 14.0 
0.3370 8.1 


peak 


100 
25 
10 

5 

2 


0.0019 45.7 
0.0071 34.1 
0.0158 28.0 
0.0283 24.7 
0.0927 20.8 


0.0032 39.6 
0.0099 29.9 
0.0222 23.8 
0.0449 19.6 
0.1887 9.9 


0.0016 51.2 
0.0069 35.2 
0.0165 27.3 
0.0320 22.5 
0.1176 14.6 


0.0075 32.8 
0.0226 24.3 
0.0592 18.6 
0.1161 14.1 
0.2702 8.8 


sing 


100 
25 
10 

5 

2 


0.0016 38.0 
0.0068 27.1 
0.0161 20.3 
0.0279 17.3 
0.0631 15.2 


0.0025 33.2 
0.0095 23.1 
0.0233 16.6 
0.0372 13.2 
0.0792 10.9 


0.0014 43.6 
0.0060 31.8 
0.0154 20.9 
0.0313 15.2 
0.0652 13.0 


0.0070 26.3 
0.0237 17.5 
0.0377 13.6 
0.0547 12.7 
0.2025 8.8 



Table 1: Comparison of sparse PC A methods in single spike settings: average loss in estimation 
and size of selected feature set. 
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(A^, A|, A3, A|) 


TO 


i(n„,-Pm) 


ITSPCA 


AUGSPCA 


CORSPCA 


DTSPCA 


(100,75,50,25) 


1 

2 
3 

4 


0.0216 
0.0180 
0.0094 
0.0087 


0.0260 
0.0213 
0.0129 
0.0122 


0.0240 
0.0214 
0.0126 
0.0181 


0.0378 
0.0308 
0.0234 
0.0235 


(60,55,50,45) 


1 
2 
3 

4 


0.3100 
0.2675 
0.1844 
0.0157 


0.2588 
0.2045 
0.1878 
0.0203 


0.2548 
0.2095 
0.1872 
0.0178 


0.2831 
0.2349 
0.1968 
0.0333 


(30,27,25,22) 


1 
2 
3 

4 


0.3290 
0.3147 
0.1740 
0.0270 


0.2464 
0.2655 
0.1662 
0.0342 


0.2495 
0.2882 
0.1708 
0.0338 


0.2937 
0.3218 
0.1821 
0.0573 


(30,20,10,5) 


1 
2 
3 

4 


0.0268 
0.0237 
0.0223 
0.0298 


0.0392 
0.0353 
0.0336 
0.0414 


0.0380 
0.0391 
0.0372 
0.0717 


0.0658 
0.0605 
0.0599 
0.0638 



Table 2: Comparison of sparse PCA methods in multiple spike settings: average loss in estimation. 

For each simulated dataset, we estimate Vm for m = 1,2,3, and 4. The last four columns of 
Table [2] present the losses in estimating subspaces, averaged over 100 runs, using the same sparse 
PCA methods as in single spike settings. For ITSPCA, we set the thresholds {7nj, J = 1, . . . , 4} as 
in (j3.3p with 7 = 1.5. All other implementation details are the same. Again, we used recommended 
values for parameters in all other competing methods. 

The simulation results reveal two interesting phenomena. First, when the spikes are relatively 
well separated (the first and the last blocks of Table [2]), all methods yield decent estimators of 
Vm for all values of m, which implies that the corresponding eigenvectors are also estimated well. 
In this case, ITSPCA always outperforms the other three competing methods. Second, when the 
spikes are not so well separated (the middle two blocks, with m = 1,2, or 3), no method leads to 
decent subspace estimator. On the other hand, all methods give reasonable estimators for V4,, for 
A| in both cases are well above 0. Here, ITSPCA again gives the smallest average loss. This implies 
that, under such settings, we fail to recover individual eigenvectors, but can still estimate V4 well. 

In summary, simulations under multiple spike settings not only demonstrate the competitiveness 
of Algorithm [H but also suggest: 

1. The quality of principal subspace estimation depends on the gap between successive eigen- 
values, in addition to the sparsity of eigenvectors; 

2. Focusing on individual eigenvectors can be misleading for the purpose of finding low-dimensional 
projections. 

6 Proof 

This section is dedicated to the proof of the main results in Section 13.41 We first summarize the 
main ideas in Section [6.11 and divide the proof into three major steps, which are then completed in 
sequel in Sections I6.2H6.4I 
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6.1 Main Ideas and Outline of Proof 



The proof is based on an oracle sequence approach, the main ideas of which are as follows. First, 
assuming oracle knowledge of the high signal feature set H, we construct a sequence of p x m 
orthonormal matrices {Q^^^'°,k > 0}. Then, for this sequence, we study how fast it converges, 
and how well each associated column subspace approximates the principal subspace Vm of interest. 
Finally, we show that, with high probability, this oracle sequence is exactly the sequence 
0} obtained by Algorithm [TJ Therefore, the actual estimating sequence inherits from the oracle 
sequence various properties in terms of estimation error and number of steps needed to achieve the 
desired error rate. Here, the actual sequence mimics the oracle one because the thresholding step 
forces it to only consider the high signal coordinates. 

In what follows, we first construct the oracle sequence and then lay out a road map of the proof. 
Here and after, we use an extra superscript "o" to indicate oracle quantities. For example, Q^^^'° 
denotes the A;-th orthonormal matrix in the oracle sequence. 



Construction of The Oracle Sequence 

First, we construct 

Q(0),o 

using an oracle version of Algorithm [2l where the set B is replaced by its 
oracle version B° = B n H. This ensures that Q^^^'" = 0. Here, q„ is given by ()3.2p . 

To construct the rest of the sequence, suppose that the p features are organized (after reordering) 
in such a way that those in H always have smaller indices than those in L, and that within H, 
those in B° precede those not. Define the oracle sample covariance matrix 



5° 



Shh ■ 
III 



(6.1) 



Here, III is the identity matrix of dimension card(L). Then, the matrices {Q^^^'°,k > 0} are 
obtained via an oracle version of Algorithm [H in which the initial matrix is Q^^^'°, and S is 
replaced by 5°. Here, the 7nj's are specified as in (13. 3p . 

Remark 6.1. The above formal construction does not guarantee that the Q^'^^'" matrices have full 
column rank or that = for all k. Later, we show in Lemma [6.3l Proposition 16. II and Lemma 

that these statements are true with high probability for all k < 2K. 



Major Steps of The Proof 

We first introduce some notation. In the k-th iteration of the oracle Algorithm [H denote the 
matrices obtained after the multiplication and the thresholding steps by 

rp{k),o ^ 50g{fc-l),o ^ (^W,o^^ 

f(')- = (iif") Withti;)- = ,(t[5.)'°,7n,). ^'"'^ 

Moreover, denote the QR factorization of f C')'" by f C')'" = Last but not least, let 

Vil:^'° = ran(Q('=)'°). 

With the oracle sequence, a joint proof of Theorems 13.21 and 13.31 can be completed by the 
following three major steps: 
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1. Show that the principal subspace of S" with dimension m, denoted by V^, satisfies the error 
bound in ()3.8p for estimating Vm] 

2. Show that for K in (13. Oh . after K steps, the approximation error of Vm^'" to also satisfies 



the bound in (jS.Sp : 

3. Show that the oracle sequence satisfies = for all k < 2K, and that the oracle sequence 

and the actual estimating sequence are identical up to 2K iterations. 

In each step, we only need the result to hold with high probability. By the triangle inequality, 
steps 1 and 2 imply that the error of Vm^'" in estimating Vm satisfies the bound in (13. 8p . Step 3 
further shows this is also the case for the actual estimator Vm\ In addition, it implies the correct 
exclusion property. 

In what follows, we complete the three steps in Sections I6.2H6.4I 

6.2 The Principal Subspace of 5"° 

To study how well the principal subspace of 5° approximates Vm, we divide into a "bias" part and 
a "variance" part. 

Consider the "bias" part first. Define the oracle covariance matrix 



^HH 

III 



(6.3) 



which is the expected value of 5*°. The following lemma gives the error of the principal subspace 
of S° in approximating Vm, which could be regarded as the "squared bias" induced by feature 
selection. 

Lemma 6.1. Let the leading eigenvalues of T,° be £° > ■■■ > 1°^, and {g°,...,g5^} he a set of 
corresponding eigenvectors. Denote Q° = [q°, . . . , Qm]- Then, uniformly over J-n, 

(1) \£° - {\] + l)|/Af ^ as n ^ oo, /or i = 1, ... , fh; 

(2) For sufficiently large n, Qf, = 0, and for Vm = ran((5°), there exists a constant C = 
C{m, r, k), s.t. 



L(Vm,V^)<CMnT, 



2 



A proof is given in the appendix. Weyl's theorem [30|, Corollary 4.4.10] and Davis-Kahn's sin 9 
theorem 0] play the key roles in the proof here, and also in those for Lemmas 16.21 and 16.31 

Turn to the "variance" part. We look at how well the principal subspace of S° estimates that 
of S°. Since S° = E[S'°], the error here is analogous to "variance". 

Lemma 6.2. Let the leading eigenvalues of S° be 1° > ■■■ > i'^, and {q^, . . . ,q^} be a set of 
corresponding eigenvectors. Denote Q° = Then, uniformly over Tn, with probability 

at least 1 — Cop~'^, 

(1) -^0 asn^ oo, for j = 1,.. . ,m; 
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(2) For sufficiently large n, Q^. = 0, and for = ran((5°), there exist constants Ci = 
Ci(m,r, k) and C2, s.t. 

nm- 

A proof is given in the appendix. 

By the triangle inequahty, the above two lemmas imply that the error in estimating Vm with 
satisfies the bound in (13. Sp . which completes step 1. 



6.3 Properties of The Oracle Sequence 

In step 2, we investigate the properties of the oracle sequence. The goal is to show that, with high 
probability, for all k > K, the error of the oracle subspace estimator Vm in approximating 



satisfies the bound in (j3.8p . To this end, characterization of the evolution of the oracle sequence in 
Proposition 16. II plavs the key role. 

The Initial Point 

We start with the initial point of the oracle sequence. The following lemma shows that Q^^^'° is 
orthonormal, and is a good initial point for (oracle) Algorithm [TJ 

Lemma 6.3. Uniformly over J^n, with probability at least 1 — Cop~^, 

(1) B° = B; 

(2) \lj{SBoB°) - §|/A? ^ as n ^ 00, /or j = 1, . . . , m; 

(3) For sufficiently large n, Q^^^'° has full column rank, and L{V^,V^'°) < (1 — /5)^/5. 

A proof is given in the appendix. Claims (1) and (2) here, together with Lemmas 16.11 and 16. 2^ 
also imply that £j(5b_b)/(A^ + 1) — >• 1 as n 00. So, the £j{SBB)^s are good estimators of the 
leading eigenvalues of S. 

Evolution of The Oracle Sequence 

Next, we study the evolution of the oracle sequence. Let 9^^^ G [0,7r/2] be the largest canonical 
angle between the subspaces and Vm^'° . By the discussion after (\2.lh . we have 

sin2 0('=) = L(p;^,pW'°). (6.4) 

In addition, let 

P = C+i/C (6.5) 

denote the ratio between the (m + l)-th and the m-th largest eigenvalues of S°. The following 
proposition describes the evolution of 9^'^^ over iterations. 

Proposition 6.1. Let n be sufficiently large. On the event such that the conclusions of Lemmas 
\6.1l\6.3\ hold, uniformly over Tn, for all k > 1, 
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(1) Q^^^'° is orthonormal, and 6^^^ satisfies 

sin + u) sec 9^^-^\ (6.6) 

where a; = (C)"i [card(/7) EJLi 7^,] '^'z 
("S; For any a G (0,1/2], // 

sin^ ^('^-i) < 1.01(1 - a)-2w2(i _ ^)-2^ ^g_7) 

then so is siv? 9^^\ Otherwise, 

sin^ sin2 ^C^-i) < [1 - a(l - p)f . (6.8) 

A proof is given in the appendix, the key ingredient of which is Wedin's sin0 theorem for 
singular subspaces [sH]. 

The recursive inequaUty (j6.6p characterizes the evolution of the angles 6'^^\ and hence of the 
oracle subspace Vm^'° . It is the foundation of claim (2) in the current proposition and of Proposition 
[O] below. 

By (16. 4j) . the inequality (j6.8p gives the rate at which the approximation error L(V^,Vm^'°) 
decreases. For a given a € (0, 1/2], the decrease rate is maintained until the error becomes smaller 
than 1.01(1 — a)~'^uj'^{l — p)~^. Then the error continues to decrease, but at a slower rate, say, 
with a replaced by a/2 in (16. Sp . until (16. 7p is satisfied with a replaced by a/2. The decrease 
continues at slower and slower rate in this fashion until the approximation error falls into the 
interval [0, 1.01wV(l - pf], and remains inside thereafter. 

Remark 6.2. Together with Lemma 16.31 Proposition 16.11 also justifies the previous claim that ele- 
ments of the oracle sequence are orthonormal with high probability. 



Convergence 

Finally, we study how fast the oracle sequence converges to a stable subspace estimator, and how 
good this estimator is. 

To define convergence of the subspace sequence {Vm'", k > 0}, we first note that IMco^ /{l-pf 
is almost the smallest possible value of L(V^,Vm^'°) that (16. 6p could imply. Indeed, when sin 
converges and is small, we have sin^^''^^ ~ sm6^^^^\ and cosO^'^^ ~ 1. Consequently, ()6.6p reduces 
to 

sin 6^''^ < (psin0(^')+a;)(l + o(l)). 

So, L{V^,V^^'°) = sin'^e^''^ < {1 + o{l)) oj"^ / {I - . In addition. Lemma E2] suggests that we can 
stop the iteration as soon as L[Vj^,Vm^'") becomes smaller than a constant multiple of e^„^, for we 
always get an error of order 0{e^m) estimating Vrn, even if we use directly. In observation 
of both aspects, we say that Vm^'° has converged i^ 

LiV^n^n < (13^(137)2' 'Q- (6-9) 

Under this definition, the following proposition shows that it takes K iterations for the oracle 
sequence to converge, and for all k > K, the error of approximating "P^ by Vm^'° satisfies (|3.8p . 

^Here, the introduction of the multiplier (1 — n~^)~^ is only for mathematical convenience. 
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Proposition 6.2. Let K he specified in (j3.9p . For sufficiently large n, on the event such that the 
conclusions of Lemmas 1 6. 1\\6. 3l hold, uniformly over Tn, it takes at most K steps for the oracle 
sequence to converge. In addition, there exist constants Ci = 6*1(7, r,m,K) '^^^ ^2, such that for 
all k>K, 

snpL{V^,Vi^'°) < CiMnrl^ + Cae^- (6-10) 
A proof is given in the appendix, and tliis completes step 2. 

6.4 Proof of Main Results 

Here we come back to prove tlie properties of tlie actual estimating sequence. The proof relies on 
the following lemma, which establishes the equivalence of the actual sequence to the oracle one up 
to 2K iterations. 

Lemma 6.4. Let /3 = cj ^/rn in (|3.5p . There exist a constant 70 = 70(c), such that if we set 'jnj 
as in (13. 3j) for some 7 > 70, then for sufficiently large n, with probability at least 1 — Cop~'^, for 
k = 0,l,...,2K, we have Q^^'^'° = 0, Q^^^ = QW'°, and hence vln^ = M^'". 

A proof is given in the appendix, and this completes step 3. 

We now prove Theorems 13.21 and 13.31 by showing that the actual sequence inherits the desired 
properties from the oracle sequence. Since Theorem 13.11 is a special case of I3.2( we do not give a 
separate proof. 

Proof of Theorem \3.2l Note that the event on which the conclusions of Lemmas I6.1fl6.4l all hold 
has probability at least 1 — Cop~'^. On this event, we have, for k = K, . . . , 2K, 

LiV„^,V^^) = LiV„^,V^^n 

< [Ly\v^,vi,) + L''\v"rnm + L''\v°^M'^n]' 

< C[L(P^,P°,) + L(p;^,P°,) + L(p;^,p(,^)'°)] 

Here, the first equality comes from Lemma |6.4[ The first two inequalities result from the triangle 
inequality and Jensen's inequality, respectively. Finally, the last inequality is obtained by replacing 
all the error terms by their corresponding bounds in Lemmas 16.11 16.21 and Proposition 16.21 □ 

Proof of Theorem \3.3[ Again, we consider the event on which the conclusion of Lemmas 16. 1H6.4I all 
hold. Then Lemma 16.41 directly leads to the conclusion that = Q^^^'" = 0. □ 

We conclude the section by proving Corollaries 13.11 and 13.21 

Proof of Corollaru \3.1[ Under conditions GR, SP and AD(m, k), we have = o{t^,^ V e,^^), and 
so = o(CiM„r^^ + C2e^^). In addition, we note that the loss function (|2.1|) is always bounded 
above by 1. Let E denote the event on which the conclusions of Lemmas I6.1H6.41 hold. Then we 
have 



EL{Vm,V^J:^) = EL{Vm,H'%E)+m'Pm,ri%E^ 



< supL(P„, P(^))P(E) + P(E^) < CiMnT^^ + CseL- 

E 

This completes the proof. □ 
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Proof of Corollary Let Q^^'^ = [q^\ ■ ■ ■ ,qm^]- By the construction of Algorithm [H for any 
/ < m, span{(ii'\ . . . ,5/'^^} = vl''\ where Vi''^ is the estimating subspace obtained after the A;-th 
iteration of Algorithm [1] with target rank I. Therefore, when AD(/, k) holds for I = j — Theorem 
13.21 leads to 

LiVuVi"^) < CiMr^rli + C2eli, l=j- (6.11) 

For / = j — write the projection matrices associated with Vi and "P^^ as Pi and Pi^\ respectively. 

Then, the projection matrices onto span{gj} and spanjg^'^^} are Pj — Pj-i and Pj^^ — Pj'^i- This, 
together with (j2.ip . implies 

L(span{g,},span{g;.^)}) = \\{P, - - {P^''> - P^"},)]]' 

= \m - pj'^) - ip,~i - pi-\)f 

< 2\\Pj - Pj^^f + 2||P,-_i - 

= 2L{Vj,vf'^) + 2L{Vj-i,vf\) 
<C7iM„r2.+C2(e2,_iVe2.). 

The last inequality comes from ()6.1ip and the fact that r^^ > t^j-i- □ 

Acknowledgment 

The author would like to thank Professor Iain Johnstone for many helpful discussions. 

A Proof of Intermediate Results 

This appendix gives proofs of Lemmas I6.1H6.4I and Propositions 16.11 and 16.21 To facilitate later 
discussion, we note that the data matrix Xnxp admits the representation 

X = VAQ' + Z. (A.l) 

Here, Vnxm = (%) has i.i.d. iV(0, 1) entries, Amxm = diag(Ai, . . . , A^), Qpxfh = [qi,---,qfh] is 
orthonormal, and Znxp = i^iu) has i.i.d. A^(0, 1) entries. 

A.l Proof of Lemma 16.11 

We use Weyl's theorem (Corollary 4.4.10 in [i^]) to prove claim (1), and Davis-Kahn's sinO theorem 
(Theorem IB. ip to prove claim (2). In addition, the following lemma is useful. 

Lemma A.l. The quantities \\qjL\\2 ^'^^ card(ff) satisfy the following bounds: 

(1) hjLg < [2^s,"(/3t„,)-" Ap](/3r„,f , i = 1, . . . ,m; 

(2) When n is sufficiently large, fh < card(if) < CMn, where C = C{m,r). 
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Proof. For claim (1), we apply the "ideal risk calculation trick" in jlj]. Let tj = tj{n) be the 
solution of Sjt~^/'^ = /3r„j. We have 



On the other hand, by the definition of L, we always have the bound H^jxlll — card(L)-max^gi{g3j} < 
p(/3T„j)^. Recall the definition of M„, we complete the proof of (1). 

For claim (2), we first show that card(i7) < CM„. To this end, for j = 1, . . . , m, let 

= {f : \qvj\ > I^Tnj}- 

Note that H = UjLi-f^j- So, card(ff) < Z^^li card(-ffj). For each Hj, the weak-^^ constraint 
implies that card(Fj) < s^(/3t„j)-^ So, card(i/) < p/\J2f=i s^il^^nj)"' < CMn- Since /? = c/^, 
we get C = C{m,r). 

To show card(-ff) > m, it suffices to show that Qh- has full column rank. Note that Q is 
orthonormal. So, this is equivalent to showing ||(5l.|| < 1. For claim (1) implies that 



WQLf <\\QL.\\l = Y.h3L\\l<Y. 



{liTnjY = 0{l). 



Here, the last equality comes from condition SP. So, when n is large, ||Ql-|| < which completes 
the proof of (2). □ 

Proof of claim (1) Weyl's theorem states that |£° - (A| + 1)| < - for all j. Note that 
J? , "^^^^ , where T^hl = Qh-^^Q'l. and I^ll - III = QlA^Q'l.- Thus, 

-2^LH iLL - ^LL\ 

— S|| < 2||S/fi|| + — Ill\\- Note that ||(?j_H'||2, Ikjilb < 1, and that Lemma fA .11 together 
with condition SP, implies that ^JLi Ikjilb = o{l). Thus, we obtain 

m 

- Ill\\ <\V^ Ikjxib = o{\\). 
Therefore, \£° - (A^ + 1)| < - S|| = o(Af), for j = 1, . . . ,?fi. 

Proof of claim (2) First, claim (1) implies that, when n is large, C > A^ + l-o(Af) > 1. Thus, 
the m leading eigenvectors of E° could be constructed by first taking the m leading eigenvectors of 
'^HH, and then augmenting all the coordinates in L with zeros. Consequently, Q^, = 0, 

To further obtain the bound on L{Vrm'Pm)y we apply Davis-Kahn's sin0 theorem (Theorem 
lBT]l . In Theorem EH let ^ = S, ^ + ^ = S°, Go = Q, and Fq = Q° . Since Go is orthonormal, 
the theorem states that 

^o.^ ii(Qgy(s°-s)Qf 
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where Q° is a p x (p — m) orthonormal matrix whose columns are orthogonal to those in Q° . The 
subscript 'c' is used to indicate the range of Q° is the orthogonal complement to that of Q° . 

Consider the right side of (|A.2p . For the denominator, claim (1) and condition AD(m, k) imply 

+ 1) - C+il > - A^+i - o{\l) = {\l - + o(l)). (A.3) 

For the numerator, since Q° is orthonormal, we can bound it as ||((5c)'(5^° — ^)QII — II ~ ^)QII- 
Furthermore, we have 



?1 , Tp2 



where Eq^h = -Qh-A^Q'l.Ql-, E^^^ = -Ql K^Q'h.Qh- and = -Ql-^^Q'l.Ql- Thus, we 

obtain - < ||£^o,h|| + II-E^olII + II^^olII- ^^^t follows, we bound each of the three 

terms on the right side. Take E^ ^ for example. Let 

Aq = diag(Ai, . . . , A^), A^ = diag(A^+i, . . . , A^), Qi = [g^+i, • • • ^m]- (A.4) 

Then -El^ = Ql-A-IQ'h-Qh- + Qi,l-AIQ[ h.Qh- = Ql-A^Q'h.Qh- - Qi,l-AIQ[ l.Ql: Here, the 
second equality comes from the identity Q[ j^.Qh- + Q'l l-Ql- = Q'lQ = 0. So, 

II^lII < ||gL.||||A^||||Q/f.f + ||QL.||||A?||||Qi,i.f 

<||QL.||||A2||(l + o(l))<A?||Qi.||F(l + o(l)). 
Similar arguments show that HE'c^hH, II^olII ~ o(A^||(5l.||f). Thus, we obtain that 

||(Q°)'(S° - < - < A?||Qi.||F(l + o(l)). (A.5) 

By dA^, (1X3]) . we obtain that 

X^WDr l|2 / X2 \ 2 m 

V^m "^m+l) ^m+1 / j=i 

Since Af/(A^ — X^+i) < k under condition AD(m,K), we complete the proof by bounding the 
Ikji Ill's using Lemma [A. 1[ 

A. 2 Proof of Lemma 16.21 

We prove Lemma 16.21 by a similar approach to that for Lemma l6.ll The major difference here is 
that we are dealing with a random matrix S° instead of a deterministic one. 

Many of the bounds to appear in the proof involve the following two quantities: 



C(n,/) = 2J1 + 1+6J^1^^, (A.6) 
V n n V n 



/, m;b) = Jl + 1^ Ji + + b.M^ ] . (A.7) 



n \ \ n y n V n 



Here, l,m,n and 6 > 0. By Proposition lB.il ({n, I) gives a probabilistic bound on the spectral 
norm of the difference between a random (scaled) Wishart matrix A and its mean I. On the other 
hand. Proposition IB.2I shows that ^{n,l,m;b) is a probabilistic bound on the spectral norm of 
Y'Z/n, where Y and Z are independent n x I and n x m matrices with i.i.d. A^(0, 1) entries. For 
controlling these quantities, the following lemma will be used multiple times. 
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Lemma A. 2. There exist constants Ci(/3, r, k), C((/3, r, k), C2 and C2, such that when n is large, 



C(n, card(ii)) < 61 — -j^=^^ + C2I 



\/log(Pn) V n 



V Af + 1 log p„ V n 



Proof. By Lemma lA.ll we have 



card(i^) CM„ CM^ 2 



n/i(A2,) (A2, + 1) log(p„) - " ""^ (Af + 1) log(p„,) • 
Here, the second inequahty comes from condition AD(m, k). Therefore, we obtain 

C(?i,card i7 < , ^ ^ + CMn7;„^ + C21 

Vlog(pn) (Ai + 1) log(Pn) 



n 



n 



\/log(p„) V (A2 + l)^log(p„) 
AfMy /log(p„) 
Vlog(Pn) V n 

The bound on ^(n, m, card(i^); 2) follows from similar arguments. □ 

and we decompose 



To facilitate the arguments, note that the difference 5*° — S° 
Ehh as Ehh = Ylk=i ^k, where 



Ehh 




El = QhA (-V'V - l) AQ'h., E2 = -Z'hZ.h - Ihh, 

^ (A.8) 

^3 = - {QhAV'Z.h + Z'.hVAQ'h.) . 
n 

Proof of claim (1) As before, Weyl's theorem implies — < — = ||£'//iif || < 

X]fc=i ll-^fcll- what follows, we derive bounds for each of the ||i?fc||'s. 
Consider Ei first. Since HQ/f-H < 1 and ||A|| < Ai, on the event 

\\\-V'V - III <C(n,m)], (A.9) 
^ n ' 

we obtain ||i?i|| < HQ/f-lPHAlPHiyF — /|| < \\C{n,m). Next, for £'2, we immediately get the 
bound ll-E'211 < C(^) card(ff)) on the event 

{W-Z'.^Z.h-IhhW < C(n,card(//))}. (A.IO) 
Finally, for £^3, since HQh-II ^ 1 and ||A|| < Ai, on the event 

{IIV^'^.hII < n^(n,m,card(if);2)}, (A.ll) 
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we have ||i?3|| < ^||(5//.||||A||||F'Z/f.|| < 2Ai^(ri, m, card(ff); 2). Combine these bounds and apply 
Lemma |A.2[ We obtain 



Here, the last equality comes from conditions GR and SP. Furthermore, observe that Propositions 
El] and EJ] imply the events in (jX9]) . ([ATTO]) and fATT]) occur with probability at least 1 — Cop„ , 
and so does the intersection of them. This completes the proof of claim (1). 

Proof of claim (2) We use Theorem IB.ll again. In the theorem, let A = S°, A + E = S°, 

Go = Q° , Fq = Q°. Then Fi = Q", where Q° is a,px (p — m) orthonormal matrices whose columns 
are orthogonal to those in Q° . As before, the subscript 'c' is used to indicate the range of Q° is the 
orthogonal complement to that of Q°. With this setup. Theorem IB.ll asserts that 

UV V° ) < f A 12) 

Consider the right side of ()A.12p . For the denominator, claim (1), Lemma l6.ll and condition 
AD(m, k) imply that \im-^m+i\ > -^m - -^m+i ~ = (-^m - ^m+i)(l + For the numerator, 

(jA.SP leads to the decomposition: 

3 3 

(Q°)'(5° - = {QlH.yEHHQ%. = Y.^QlH.)'EkQ%. = E ^1 (A.13) 

fe=l k=l 

Here, Q° ^, denotes the submatrix of Q° that contains all its rows in H . Thus, the triangle inequality 
leads to an upper bound of the numerator if we bound each ll-E^II, which we carry out below. 

For El, note that E^ = {QIh.)'Qh-^{IV'V - l) ^Q'h.Q%.- Since C(n,m) < C^\og{pn)/n, 
||A|| < Ai and \\Qh-\\, \\Q°h.\\ < 1, on the event (IXOl) . we have 

\\E\\\ < \\{QiHyQHM\\-v'v-i\\\mQHMQu 

n 



<C7Ai^i^||(Q°^.)'QH.A||. 

To further control the last term on the right side, we apply an "orthogonal decomposition" trick. 
Observe that Q^. is a card(i7) x m orthonormal matrix, because Q° is p x m orthonormal with 
= 0. Define as a ca,id{H) x (card(-ff) — m) orthonormal matrix whose columns are orthogonal 
to those in Q'jj., we obtain the decomposition 

lHH = QUQH.y + PHiPHy- (A.14) 

Thus (Q°^.)'Qh.A = {QlHyQjjXQ^yQH-A + iQljj.yP^iP^yQH-A. The first term on the right 
side can be bounded as 

\\{Qc,HyQHXQ°H.yQH.A\\ < Ai WiQlH.yQH-W = Ai||(Q°)'Q°||. 
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Here, the first inequahty holds because ||, ||Q_f/.|| < 1 and ||A|| < Ai, and the last equality 
comes from the observation that Q^. = 0. By (IA.4p . the second term in the last display could be 
further decomposed as (Q°j^.)'P|^(PH)'<3^^-^o + (Q°iy.)'^'H(i^H)'Qi,H.Ai. Note that ||(P^)'Q//.|| < 
ll(Qc)'QII) ll^oll < ll^ill ^ Am+i, and the spectrum norms of all the other matrices are uniformly 
bounded above by 1. Thus, we obtain that 

miH.yPHiPHYQH.AW < Ai||(g°)'Q|| + A^+i. 



Combining the parts leads to 



WiQlH.YQHM < A™+i + Ai ||(Q°)'Q°|| + ||(Q°)'g|| 



(A.15) 



which in turn implies \\E^\\ < C [\iXm+i + Af + ||(Q°)'Q|| 

Turn to £'2- (|A.14p leads the decomposition E2 = -E21 + -^221 where 



n 



^21 - {Q°c,H-)'Q°H- 



n 



-{Z.HQ%)'{Z.HQ%)-Ir, 



n 



For E^-^, recah that Ql, = 0. Thus, || II = ll(Q°)'Q°ll- So, on the event 

{\\-{Z.HQ°H.nZ.HQ°H.) - Im.\\ < C(n,m)}, 



(A.16) 



we obtain \\E^^\\ < ||(Qc)'<3°IIC(^, H < C||(Q°)'Q°|| v^log(p„)/n. For E^^^ on the event 

{\\{Z.HP^yZ.HQ"H.\\<n^{n,m,cavd{Hy,2)}, (A.17) 

we obtain ||-E'22ll — ^(^^5 ^) card(i/); 2) < ^(n, ?7i, card(-ff); 2). Combine the two parts and apply 
Lemma |A.2[ We obtain 



11^211 < ^l^TTT 



+ C2 1 + II Q°Q^ 



V(A^ + l)log(p„) 
For E^, (IXSj) and (lAl3l) lead to E^ = E^^ + ^^2^ where 



log(p„,) 



n 



^31 = -iQlHyQH.AV'Z.HQ%., EI2 = ^(Q- yz'.HVAQ'H.Qh 



For E^i, on the event 



(fXT5]) and (fXT]) imply 



{\\rz.HQ°H.\\<na 



n, m, m; 



2)}, 



1 



l|i^3"ill<-ll(Qc,H.)'0H.A||||v^'^.//Q?f. 



n 



< 



Xm+i + Ai (II (Q°)'Q°|| + II {Q°yQ\\) C{n, m, m; 2) 

<c A^+i + Aifii(Q°yQ°ii + ii(g°yQii 



log(Pn) 



(A.18) 



n 
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In addition, on the event (jA.lip . we obtain ||i?32|| < Ai^(?i, m, card(ff); 2). Then the triangle 
inequahty and Lemma lA.21 lead to 



A3m^/V r / ^ M / 

ElW < C^^^=^=^= + C2(Ai + A™+i) + C3Ai(||(Q°)'Q°|| + ||(Q°)'Q||) J 
A/iAr + 1) loelo„ I L \ /J v 



V(A^ + l)log(p„) 
Summing up the above bounds for the H-E^H's leads to 

(-^m - + 0(1)) 



'log(p^ 



n 



<Ci 



\l (Ai + 1)mI'\, 



nm 



- ^m,+l \/(A^ + l)log(p„) 



Am - A,„+i V n V 

Condition AD(7ti, k) implies A^/(A^ — A^^_,_|) < n. We collect terms and apply Jensen's inequality 
to obtain 



log(p„) (A2,-A2 )2 



Since Z.hQ%. is a n x m matrix with i.i.d. A^(0, 1) entries, independent of the matrices Z.nPfj 
and V which also have i.i.d. A^(0, 1) entries. Propositions IB . Il and IB. 2] thus imply that the events in 
()A.16p . (|A.17p and (jA.lSP occur with probability at least 1 — Cop^^, and so does the intersection 
of them. This completes the proof. 

A. 3 Proof of Lemma 16.31 

We first introduce some preliminary results. The first result shows that, with high probability, the 



set B selects all the "big" coefficients in the leading eigenvectors up to order 0( Y'log(p„)/(nA|) ) . 
Indeed, for < a_ < 1 < a+, define 



O / x2 2 ^ / log(pn) ) 

.7 = 1 



Lemma A. 3. Let B± he defined as above. For appropriately chosen a and a=p, when n is sufficiently 
large, with probability at least 1 — Cop^^, B^ C B C C H. 

Proof. Note that S^u ~ crlXn/n, where al = 1 + Y^]Li ^'jQlj- Consider the event {B_ C B}. We 
have 

P{S_ ^ B} = F{u,eBASuu < l + aVlog(p„)/n}} < ^ F{S,, < 1 + ay'log{pn)/n} 

< \^ < 1 + Q\/log"(Pn)/"- \ < p _ i< (a+ - l)^a^ log(Pn) 

~ .7^- 1 " 1 + a+aVlog(Pn)/n i ~ " 4(1 + a+ay^log{pn)/n)^ 

< „l-{a+-l)2a2(l-o{l))/4 
— Fn 
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Here, the last inequality comes from Lemma |B.2[ On the other hand, we have 

og(p„)/n} 

i^eB'^ I ~ ^ + a-av^log(p„)/?i j ~ \ n ~ 1 + a_a v^log(pn)/n 



< 



\/2(l + a-ay^log{pn)/n) 

< I is _|. I I — exp 

(1 - a_)ay^log(p„) 

< l-{l-a_)2a2(l-o(l))/4 



(1 - a_)^a^ log(Pn) 
4(1 + a_aY^log(p„)/n)2 



Here, the second last inequality comes from Lemma lB.21 From the above bounds, if we choose a 
and a± properly, then i?_ C B C B^ holds with probability at least 1 — Cop~^. 

To show i?+ C H, observe that for any u £ S+, there exists some j G {1, . . . , m}, s.t. Xjqlj > 

^W^V °^n'"^ • ^'^^ definition of H, it then suffices to show that for large enough n, 



a-«./log(Pn)^^.^.^_^. A^. + llog(p„) ^ 



n 



\4 



The inequality holds because, under conditions GR and SP, both log(p„)/(nA^) and log(p„)/n 
converge to as n — )• oo. □ 

Lemma A.4. Let el = Ef=i 55[log(Pn)/(nA4)] V2-r/4_ j.^^^^ ||q^^||2 ^ X:f=i IkjBl Hi < Cel 
and A^*Y^card(ff)/n, A|f*A/log(p„)/n = o(e„,) /or i = 1, 2. 

Proof. The bound on ||Q._b= ||p is obtained by following the lines of the proof of Lemma lA.H but 
with H replaced by In addition, conditions GR and SP, together with Lemma lA.H imply the 
bounds on X^'' y^cavd{H) / n and Aj"*-y/log(p„)/?i. □ 

We now prove Lemma 16.31 Throughout, we restrict our attention to the high probability event 
on which the conclusion of Lemma lA.31 holds. Let M = H\B°, and define 



CIO 

•^0 



So,HH 
III 



, with So,HH 



Sb°b° 
Imm 



(A.19) 



Proof of claim (1) First, claim (1) is a direct consequence of Lemma lA.31 Recall that B° = 
BnH. On the event such that the conclusion of Lemma [A3l holds, we get B C H, and so B" = B. 



Proof of claim (2) Here, we first apply Weyl's theorem to obtain 

l^ji^o) ~ ^11 - 1 1 5"° — 5'o II = \\Shh ~ So hhW ^ '^\\Smb° II + ||5'j\/A/ — /mm||- 



(A.20) 



To bound HS'mboII, observe that Smb° = "^{=1^11 where 

Di = -Qm.AV'VAQ'b.., D2 = -Z:^jZ.bo, D3 = -Qm-AV Z.bo, D4 = -Z'.^jVAQ'bo.. 

n n n n 
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Note that ||Qbo.|| < 1 and that M C Bl. So, on event ()A.9p . Lemma EH leads to 

\\Di\\ < \\A\\^-V'V\\\\Qm-\\f < XI\\-V'V\\\\Qb^ .\\f <CXlen. 
n n 

For let Z = [Z.i, . . . , Z.p], with Z.y = Z.jj/\\Z.y\\2. Thus, D2 satisfies 

II-D2II < — max ||Z.jy||2 • \\Z\,Z.B°\\- 
n ueH 

Note that \\Z.,y\\2 ~i.i.d. Xn- Lemma [B.2l ensures that with probability at least l — Cop'"^, ^ max^^g// \\Z.u 
1 + 0(1). On the other hand, since B° only depends on {||Z.^||2,z^ G H}, which are independent 
of {Z.y,^ G H}, we obtain that Z.h is independent of B". So, Z.m is independent of Z.b°- Let 
U = diag(ni, . . . ,Ucard(B°)) with uj i.i.d. Xn random variables. Then, Z'j^Z.boU has i.i.d. A^(0, 1) 
entries, and so Lemma lB.31 asserts that with probability at least 1 — Cop^'^, 



\\Z'^jZ.BoU\\ < C(^card(M) + v/card(S°) + y^log{pn) ) < C7( v/card(F) + Vlog(p„) ) . 

In addition, Lemma lB.21 shows that with probability at least 1 — Cop~'^, miujUj = — o(l)). 
Thus, we obtain that, with probability at least 1 — Cop~'^, 



\ZM < ^ii^ < C + J'^ = o(Af.„) 



n \ n 



Here, the last inequality comes from Lemma |A.4[ Thus, we obtain II-D2II = o{en)- For Di, and 1)4, 
on the event (jA.lip , Lemmas IA.2I and IA.4I lead to 

Pall < -||A||||y'Z.Bo||||gM.||F < -AiIIF'Z.hIIIIOb^.IIf < \ii{n,m,c^Td{H)-2)\\QB^ -h = o{\len). 

n n ~ 

II-D4II < i||A||||Z.Vl^||||QBo.|| < ^Ai||Z.'^y|| < Ai^(n,m,card(//);2) = o{\\en). 

Combining the four parts leads to ||-S'j\/b°|| < C\\en- 

Switch to Smm — hiM- We decompose it as Smm — Him = Yl\=i EMM,i, where 

Emma = Qm-^ ( —V'V — I] AQ'm-, Emm,2 = —Z'^Z.m - Imm, 
\n J n 

Emm,3 = - {Qm-AV'Zm + Z'mVAQ'm.) . 

Consider the intersection of the events in (|A.9P , (jA.lOp and (jA.lip . Lemmas IA.2I and IA.4I lead to 

pMM,i|| < A?||-yV||||gAf.||| < C\1\\Qb^.\\1 = o(A?e„), 
n 

||^MM,2|| < W-Z'.fjZ.H - IhhW < C(?^,card(F)) = o(A^e„), 
n 

\\Emm,3\\ < -Ai||y'Z.i^||||QM.||F < CAi^(n,m,card(F);2)||QBc.||F = o(A?e„). 
n 

Thus, we combine the three parts to obtain HS'a/a/ — /a/m|| = o(Afe„). 

Condition SP implies that en = o(l). Thus, ()A.20p . together with the bounds on ||5a/_bo|| and 
II'S'mm — -^mmII) leads to |£j(5q) — i°\ = o(A^), for j = 1, . . . , in. In addition, condition AD(m, k) 
implies that > A^/k. Thus, the last display also implies that ijiS^) = ij{SB°B°) > 1 for 
j = 1, . . . ,m. This completes the proof claim (2). 
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Proof of claim (3) Turn to claim (3). Since £j{S^) = £j{SBoB°) > 1 for j = 1, . . . , m, V^a^'" 
is the leading principal subspace of Sq. Thus, we apply Theorem IB. II with A = S°, A + E = Sq 
and 5 = £m{S^) - im+i{S°). Claim (1) and Lemma ED imply that 5 = (A^ - A^+i)(l + o(l)). 
Therefore, Theorem IB . 1 1 leads to 

L(V° < ^oll (1 + o(l)) < ^ " < Ce^ < -a - p? 

mi ' m ) — {\2 _ \2 ^2 ^ ^ '-'\^)) — /i,2 _ \2 \2 — " — ^ ^' ' 

The last inequality comes from conditions SP, AD(?7i, k) and the definition of p, and holds when n 
is sufficiently large. 



A. 4 Proof of Proposition 16.11 

Here, we restrict the proof on the event such that the conclusions of Lemmas I6.1H6.3I hold. Thus, 
all the arguments are deterministic. Recall that T^^'^'° denotes the matrix obtained after the 
multiplication step of the A:-th iteration in oracle Algorithm 1. Denote ran (TC^).") by r^'')'", and let 
(f)^^^ be the largest canonical angle between T^^^^" and Vm So, sin^ 

In what follows, we first prove both claims for the first iteration, and then extend them to 
subsequent iterations. 



The First Iteration Claim (1) relies on the following lemma, which describes the effects of the 
multiplication and the thresholding steps separately. 

Lemma A. 5. In the first iteration of oracle Algorithm 1, 

(1) After the multiplication step, T^"*^^'" has full column rank and < ptan 

0(0). 

(2) After the thresholding step, has full column rank, and L{T^^^'° ,V^^'°) < sec^ 6^^\ 

Proof. Claim (1) is essentially one-step analysis of orthogonal iteration, which is obtained by di- 
rectly applying Theorem 8.2.2 in to a single iteration. 

The proof of claim (2) relies on Wedin's sin 9 theorem for singular subspaces (Theorem IB.2p . 
In the theorem, let A = T^^^'° and B = T^^^'". Then, the theorem gives that 

L(r«'°,p(i)'°) < " ^^(^(,),^ " • (A.21) 



In what follows, we study the numerator and the denominator of the right side separately. 

First, we derive a lower bound for am{T^^^'"). To this end, for any unit vector x G M™, 
let y = then y is a unit vector in W. Decompose y as y = y + ijc, with y S ran{Q°) and 

j/c G ran(Q°). Then, it follows that ||r(i)'°2;||^ = = \\S°y\\l + \\S°yc\\l > \\S°y\\l > (C)^lly|l2- 

Since \\y\\2 > cos^^'-'''', we obtain that 

a^(rW'°)> inf ||rW'°x||i > (C)2cos2 0(°). (A.22) 

||a:||2=l 

To bound the numerator, we define matrix AT € M^^'", whose (z^, j)-th entry is given by 



32 



Then, we obtain 

||^{i),o _ 2-(i),o|| < ||^(i),o _ 2-{i).o||p < ||AT||f = Cw. (A.23) 

Here, the second inequahty comes from the constraint on the thresholding function r]. 

Plug the bounds (|X22]l and (1X23]) into (TOT]) , we obtain the bound for L(r(^)'°, P^-*'") in the 
lemma. Finally, we are to show that T^^^'" has full column rank. By Corollary 8.6.2 of and 
(TOSD . it suffices to show that ||f(i)'° - T^^)'°\\ < f^uj < e^cos6^^\ i.e., u < cos6lW. To this 
end, we note that Lemma 16.31 shows cos 6'(°) > 4/5, while ^AM\\ shows that u} = o(l). Therefore, 
cj < cos 9^''°'^ for large enough n, which completes the proof. □ 

By the discussion following the loss function (12. ip . we have the triangle inequality 
sin^W < sin,/.W +Li/2(7-{i).o,pa),o) < ptan + a; sec 

Here, the second inequality comes from Lemma lA.5[ This completes the proof of claim (1). 
Turn to claim (2). We first show that, for any a < 1/2, 

implies sin^ 9^^^ / sin^ 9^'^^ < [1 — cl{1 — p)]^. By claim (1), it suffices to show 

^^^^ ^ sin 0(0) cos 0(0) ^ 1 - "(1 - P)- 

Let X = sin 0(0). Multiply both sides of the last display with xy/l — x"^ and collect terms. Then it 
is equivalent to < [1 — a(l — p)]V^ — x'^ — p}. The right side of last inequality is non-negative, 
because < (1 — /o)^/5 (by Lemma [6.3p and a < 1/2. So, we can square both sides of it, and some 
simple algebra shows that the inequality holds if 

Q{x^) = [1 - a(l - p)fx^ - (1 - af{l - pfx^ + < q. (A.25) 

Here, Q{x'^) is a quadratic form of x^, with discriminant A = (1 — a)^(l — p)^ — 4a;^[l — a(l — p)]^. 
Note that Lemmas 16 . 1H6 . 3l lead to 

-4 



UJ 



2 



(l-p)4 ^ ^' (A2,, + l)2 n V A2„ + l ; ^ ■ (A.26) 

<C(A-4vl)M„r2 



' nm ■ 



Here, C = C(7, r, m, k). Moreover, conditions GR and SP imply that the rightmost side of ()A.26P 
tends to as n — 7- oo. Therefore, for sufficiently large n > no (7, r, m, k), we have 

a;2 (l-a)4 



(l-p)4 - 16 - 4[l-o(l-p)]2' 

and so A > 0. Let the two roots of Q be x^. We solve Q to obtain that ()A.25p . and hence the 
conclusion, holds when 
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Note that, by ()A.26p . < (1 — p)^/5 < x\ always holds. Therefore, the case is completed because 
fOil) imphes fOTl) . 

Next, we show that if srn^ 6^^^ < 1.01a;2/[(l - a)(l - p)]^, then so is sin2 6'(^). By claim (1), 
it suffices to show that ptan6'(°) + usecO'-^^ < VTMuj/[{l - a)(l - p)]. Multiply both sides by 
cos ^ obtain that the last inequality holds when 



^^(l-aKl-p) / 1.01.^ 



because the right side is no greater than cos 9^^^ by the condition on sin 0(0). Since ^/x > X for 
X G [0, 1], it further suffices to have the last inequality hold but with the square root of the right 
side removed, which is equivalent to have 

<(l-p)[l-(l-a)/^/^01]. 



(l-a)2(l-p)2 

By (|A.26j) . the last desired inequality holds when n > 110(7, 'r,m, i^)- This completes the case. 

Subsequent Iterations Now we have Q^^^'° is orthonormal with sin^ 6^^^ < (1— /?)^/5. Therefore, 
we can essentially repeat the whole argument to prove both claims for k = 2, and so on by induction. 
This is valid because the minimum sample size needed, no = no(7,r, m, k), such that both claims 
hold does not depend on the iteration index k. 

A. 5 Proof of Proposition [6T2] 

we divide the proof into two parts, depending on whether the following condition holds or not: 

1.01^V(l-p)'>eL/4. (A.28) 

Without loss of generality, we assume that n = 2^ for some I > 1. So, / = log n/ log 2. 

First, consider the case where ()A.28P holds. Let ki be the number of iterations needed to achieve 

Before the last inequality is satisfied, the decay rate of the approximation error satisfies the bound 
in claim (2) of Proposition [6TT] with a = 1/2. Thus, it suffices to have ki satisfy [1 — (1 — p)/2]'^^^ < 
1.01wV[(l-i)(l-p)]2, i.e., 2/cil log(l-i(l-p))| > log[(l - - p)^/imuj^]. Condition (1X28]) 

implies (1 - 1/2)2(1 - p)^/lMu}^ < e'l < n/i(A^). In addition, | log(l - x)| > x, for aU x £ (0, 1). 
Thus, it suffices to set 

logn + V log/l(A^) Ai, + 1 ^ , w^9^^/-, 

ki = — z = ,2 ,2 N n + V log h{Xl)] (1 + 0(1)). 

Now, let k2 — ki be the number of additional iterations needed to achieve L(V^,Vm^^'°) < 
1.01a;2/[(l — 4)(1 — p)]2. Then, before the inequality is satisfied, the decay rate satisfies claim (2) 
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of Proposition O with a = 1/4. Thus, it suffices to have [1 - i(l - /9)]2('=2-'=i) < (l _ i )2/(l _ i )2. 
This is guaranteed if we set 



k2-ki> 



l-p 



log 1 - - - log 1 - - 



Recursively, we define /c, for i = 3, . . . , /, such that L{V:^,vi'^'°) < 1.01a;V[(l - 2"0(1 - p)?- 
Repeating the above argument shows it suffices to have 



l-p 



log ( 1 - i 1 - log ( 1 



1 



•>i-l 



, for i = 3, . . . , /. 



Therefore, if we set 
1 + 1/2 



ki - ki 



l-p A2,,-A2 iog2 



log 1 - - - log 1 



then L(V^,V^'^'°) < 1.01wV[(l - n-^)(l - p)]^ for ah k > k. We complete the proof for this case 
by noting that K >ki for large enough n and that 



(1 - Pf 



< Cmj^card{H)h{Xl,) < CMnT, 



2 

nm- 



Turn to the case where ()A.28P does not hold. Then, the decay rate in claim (2) of Proposition 
16.11 holds with a = 1/2 for all k such that L(T";^,Vm^'°) > e^^. Since the approximation error is 
monotone decreasing, it suffices to verify that [1 — (1 — p)/2]'^^ < e^^, i.e.. 



2K 



log 1 



l-p 



> log 



n 



log(Pn) 



log 



(A? + l)(A^^+i + l) 



Note that the second term on the right side is bounded above by logh{X^). Since log(p„) = o(n) 
and I log(l — x)| > x for x € (0, 1), it suffices to verify 

K>-^ [log n + log h{Xl)] = J"^ ^} log n (1 + o(l)), 

^ ~ P '^m~ ^m.+l 

which is indeed satisfied by K for large enoug h n. Thus, for all k>K, L{V^,V^^'°) < el^. This 
completes the proof. 



A. 6 Proof of Lemma 16.41 

First of all, we restrict our attention to the event on which all the conclusions of Lemmas I6.1H6.3I 
hold. Lemma I A. 3 1 shows that B° = B, which leads to Q^^^ = Q^^^'°. In what follows, we start with 
the equivalence between Q^^^ and and then extend the argument to subsequent iterations. 
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The First Iteration Fix G L and I < j < m. Let Q^'" = [q^^^'°, . . . ,q^rn''°]- Since = 
q{o),o^ the (i^, j)-th element of T^^^ is then = Siy.qj^^''^ = SyH(fjH° ■ Here, the second equality 
holds because 9^^'° = 0. Our goal is to show that for sufficiently large numeric constant 7, the 
size of t^^j is bounded by the threshold 7„j in Theorem 3.1 with high probability. Following the 
discussion after Lemma 16.31 it suffices to show that t^^^ is bounded by a constant multiple of 
^(A2 + 1) log(p„)/?i, since £,(5bb)/(A| + 1) ^ 1. 
By ()A.14p . we decompose as 

4 = [Q%{Q%)' + Pm)'] qfi° = 4'^ + 4'^- (A.29) 
Moreover, S^h admits the decomposition S^h = Yli=i SuH,i, where 

S,H,i = -Q.AV'VAQ'h., S,h,2 = -z[,z.H 

\ \ (A.30) 

SvH.-i = —Qu-AV'Z.H, S^HA = —Z'.^VKQ'jj.. 

n n 

In what follows, we bound t^l\ i = 1,2, respectively. The argument is lengthy but elementary. The 
key conclusions are ()A.3ip and (|A.36p . which then lead to (|A.43p . 
1°. Bound for t^^\ We show below 

|^)| < (l + o(l))||(Q?,.)'g;.'2'°||2(/3V^ + 2^/3)^. (A.31) 

To this end, using (ICTjl . we decompose = Eti S^,H,iQ%.{Q%.)'t-ii'° = Eti4!^- 
For we further decompose it as 

tS^ = O..A'0'//.Q?f.(Q?f.)'?^2'° + {^v'v - kQ'H.Q%.{Q%.)'t-i'°- 

Recall (IA3D. We have \\QuA^Q'jj,Q%\\ < ||Qv.A2||2||Q'^.Q°^.|| + \\Qi,u-Alh\\Q[^H-Q°H-\\- Let 

p,,- = (Af + 1)/(A,2 + 1), l</,j<m. (A.32) 

Since G L, we have \q,j\ < /3r„„ and so WQ^.A^h < {YZi^^^Ii^hY'^ = {ZT=i PljY'\Nl)lnj, 
with 7 and 7„j specified as in Theorem 3.1. Similarly, ||Qi^,y.Af II2 < i^^m+i Pij^'^'^il^ h)^rij- 
Moreover, we have WQ'h.Q^hW < 1 and \\Q'i^h-Qh\\ < ll<5iQ°ll < L^''^{'Pm,V^) = o(l). Therefore, 

/ rn \l/2o 

\\Q.A^Q'h.Q%.\\ < (1 + 0(1)) Y.P^^^ 



On the other hand, on the event ()A.9p . we have 

IIO..A {-v'v - 1] aq'hW < WQuMhW-v'v - imQ'n. 

\n J n 



1/2 a 

Plnj I . 

= o{lnj)- 
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Here, the last equality comes from conditions GR and SP. Finally, the triangle inequality leads to 



|tff|<(l + o(l))( ^Plj) \\{Q%.yq^'^' 



1=1 



7 



For t^2^ define event 



\\Z:,Z.hQ°h.\\2 < n^n, l,m; ^/6), Vi/ G l} . 
Note that ^{n,l,m;\/G) = (1 + o(l))[6/(A| + l)]^/2(/3/7)7„j . On this event, we have 



|t«| < (l + o(l))l \\{Q%.yq)Th 



1/2 



7 



For ty^, on the event (jA.lSp . we have 



< ^\\Q.M2\\v'z.HQu\\m.yQiTh 

1/2 



1 



)0 \/c(0),O| 



n 



Switch to ^^4^ We note that \\AQ'h.Qh-\\ ^ + ^(l))- So, on the event 

{\\Z:,V\\2 < <(n, l,m;^/6),Vz. G l} , 

we have 

.2 xi/2 



< (1 + o(i)) 2^i^^ (j: I) ii(Q?,.)'^.r ii2 ^ = o(7. 



i«i< (1+0(1)) (^1^) ii(Q?..)'^rib^- 



Finally, we apply the triangle inequality to obtain 



4 r / m V 1/2 

<E\'u\ < (1 +«(!)) wiQH.riTh fi[Ep^^) +'^^pf 



4=1 



^. (A.35) 
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(A.33) 



(A.34) 



Without loss of generality, we could assume that Xi/Xm — )■ 1 as n — )■ oo. When this is the case, 
(|A.35p reduces to (|A.3ip . If this is not the case, let 1 < niQ < m be the smallest index such that 
Xmo/Xm — ^ 1- We further divide the set {!,..., m} into {!,..., mo — 1} U {mo, . . . , m}, and similar 
but lengthier argument would also lead to (lA.3ip . 



2°. Bound for t'^\ Here, we show 



r , m X 1/2 

i4'^i<(i + o(i))ii(p^)'gS3'°ii2 /3 pA +^'^1, 



.1/2 



7 



(A.36) 



To this end, observe that (IXSOjl leads to = Ef=i SvH,iP%{P%)'qfu° = Ta=i A'i ■ 



yo v;;^0),o _ ,-^4 ,(1) 
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For t2i\ by argument similar to that for , we have on event (|A.9p . 



(1) 



l4?|<(l + o(l))( E Pi^) WiPnyfi^h 



l=m+l 



1/2 



30 V:;^^'^)'°l 



7 



For t^22 ^ define 



/I) 



nt^^j/\\Z.HPMP?iyQjTh- 



(A.37) 



By our construction of the initial value, is a function of Shh and hence is independent of Z.^. 



In addition, Z.jj is independent of Z.^. Thus, we have yl}j ~ N{0, 1). Define events 

{lyj^l < V71og(p„), 1 < j < m,Vi/ G l} , 
{\\Z.h\\ < + Vcard{H) + 2Vlog(p„)} . 

On the intersection of the above events, we have 



(A.38) 
(A.39) 



i^Si < -ly^yi \\z.H\\\myq,Th < {i+o{Xim{p°Hyg,Th{j2^] 



30 Vo(0),o 



,0 N/^0),O| 



7 y^^-fnj 



Switch to t^g^ On the event (fATTIl . we have 



7 



Condition (GR) implies 

fh 



1 ca rd(ij') ^ /3 



/3- 



3k2 



n 



nV2-r-/4[log(p^)]l/2+r/4 ^ 



log(Pn) 


l/2-r/4 ^2 








(a? + i) 



r/2 



< 



/5- 



■•/4 E ' 



A|,nV2-r/4[log(p„)]l/2+r/4^^ 



nA^4 



l/2-r/4 



oil). 



Thus, we obtain 1^3^ | = o(7i 



For t^y, we could bound it using the same strategy as for t^22 ■ particular, let 



w 



(1) 



nt^^J/\\VAQH.PMP"HyQfi'l2 

-(0),o 



(A.40) 



For any u ^ L, Z.y is independent of both V and q~-^ , and so 

^w^l]\ < v/71og(p,,), 1 < J < m, Vz. G l} , 

+ 2Vlog(pO}- 



A^(0, 1). Define events 



(A.41) 
(A.42) 
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Then on the intersection of them, we have 

14^1 < l^U'^\\\v\m\myg^Th < a+om^-^y^myqf^-h^. 

Finally, the triangle inequality leads to ()A.36p . 
3° .Summary. By Lemmas I6.1H6.3| we have 

WiQUfi^ = 1 + 0(1), = o{i). 

This, together with (jA.29p . (|A.3ip and (|A.36p . implies that, if we choose (3 = cj ^fm^ then for large 
enoug li 7 > 70(c) (e.g., 7 > 1.01 • (2^/3 + c)), when n is sufficiently large, 

1*2^1 < (l + o(l))(/3V^ + 2V3)^ <7ni. (A.43) 
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Since the above inequality holds for all v ^ L and 1 < j < m, we obtain T^^ = 0, and so, 
^ f(i),o_ Therefore, Q^^) = Last but not least, Propositions EH |R2] and Lemmas [RT] 

- IB.3l ensure that the intersection of all the above defined events has probability at least 1 — Cop~'^. 

Subsequent Iterations Fix the choice of 7„j such that ()A.43p holds. Here, we extend the above 
calculation up to the 2i^-th iteration. The goal is to show that with high probability, 

Q(fc)^Q(fe),o^ A; = 1,...,2K. (A.44) 

To this end, we first recall the events defined in (]A.38p and ()A.4ip . Note that for any A; > 1, 
ul^j and w^j^j can be defined analogously by replacing with qf^'°. For any I < j < m and 

k > 0, is a function of V and Z.h, which is independent of Z.l, and so both and w^^-^ 

follows the A^(0, 1) distribution. Thus, we define events 

{\yij^\ < \/71og(p„), 1 < J < m, Vz. G L, 1 < A; < 2A'} , (A.45) 

[\wl^j\ < ^71og(p„,), 1 < J < m,Vi/ £L,l<k< 2a} . (A.46) 

Under our assumption, for any given i/, j, and k, the tail bound of standard normal distribution 
gives 

^ ^ Vlog(p„) 



Since m is fixed, card(L) < pn, and K = 0{y^n log(p„) ), we obtain that 



I (fc) 



< V71og(pn), l<j<m,yueL,l<k< 2K^ 

m 2K 

^ 1 - EEE^Ii^i?! > ^7bi(^} > 1 - cop;^ 

j=l u<^L k=l 

The same argument extends to the event in (IA.46p . Finally, on the intersection of (|A.45p . (|A.46p . 
and the other events involved in the calculation of the first iteration, we can essentially repeat the 
arguments for the first iteration 2K — 1 times to obtain ()A.44p . This complete the proof of Lemma 
[631 
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B Auxiliary Results 



We collect here some auxiliary results used in the proofs. 

First, we present two sinO theorems. One is for principal subspaces, which comes from and 
is used extensively in the proofs of Lemmas I6.1H6.31 The other is for singular subspaces, which 
comes from [35|] and is used in the proof of Proposition 16. li 



Theorem B.l (sin0 theorem for principal subspaces). Let A and A + E be symmetric matrices 
satisfying 



A = [Go Gi] 



"^0 0' 






Ai 







and A + E= [Fq Fi] 



'Ao 0" 






Ai 







withGQGi = 0, Gi orthonormal, and [.Fo-^i] orthogonal. If the eigenvalues ofAoG'^Go are contained 
in an interval (a, b), and the eigenvalues of Ai are excluded from the interval {a — 6,b + 6) for some 
S > 0, then 



L(ran(Go),ran(Fo)) = ||F{Go 



< 



Theorem B.2 (sinO theorem for singular subspaces). Suppose that p > k. Let p x k matrices A 
and B = A + E both have full column rank, and let Wfc+i, . . . ,Wp be orthonormal vectors spanning 
T8iii{B)-^ , the orthogonal complement ofran{B). Take W = [wk+i, ■ ■ ■ ,Wp], and define R = A'W. 
IfaniiniA) > 6 > 0, then 

L(ran(A),ran(5)) < \\Rf/5'^ < WEf/S"^. 

The last inequality holds as R = A'W = —E'W. 

Next, we present two probabilistic bounds on matrix norms. 

Proposition B.l ( 22] and 0]). Let Y be an n x p matrix with i.i.d. N{0, 1) entries. For t^ = 
6 -y/log n/n and any fixed c > 0, there exist no{c) > 0, such that for any n > no(c), 

< 2n-^'. 

be two independent matrices with i.i.d. N{0, 1) 



\-Y'Y-I\\ > 2j^ + ^ + ctr 
n y n n 



Proposition B.2. Let Y G M"^' and Z G M"^™ 
entries. Then for any < a < \\fri and b > 0, 



\Y'Z\\ > n + ^ ( ^/i + + ^ ) ^ < (/ A m)e-3«Vi6 + ,-^72. 



n 



Proof. Without loss of generality, suppose / < m. Define Y = [yi, . . . ,yi], where yi = yi / \\yi\\2 with 
yi the i-th column of Y. Then, we obtain < (maxi<j</ ||yj||2) Note that \\yi\\2 are 

i.i.d. Xn I'andom variables. We apply Lemma IB. 21 to obtain that 

I 

nwvih > v^^, i = 1, . . . ,1} <Y,n\M2 > ^/^^i < /e-^^'/ie. 

i=l 
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On the other hand, Y' Z is an Z x m matrix with i.i.d. A^(0, 1) entries. So, Lemma IB. 31 leads to 
Y'Z\ > Vl + \/^ + b} < e~'''/2. Therefore, we obtain 



\Y'Z\\ >n^|l + lUy^+J^+j^)\<n\\y.h> V^a, ^=l,...,l} 



+ p{||y'z|| >Vl + ^M + b} 

This completes the proof. □ 

Finally, we list three probabilistic bounds which have been used multiple times. 
Lemma B.l. Let x ~ iV(0, 1). For any t > 0, ¥{\x\ > t) < y^r^e^^'/^^ 

Lemma B.2 ([3]). Let Xn denote a Chi- square random variable with n degrees of freedom. Then 

P{Xn > n{l - e)} < e"'"''/'', when < e < 1, 

¥{xl > n(l + e)} < e-3"^'/l^ when < e < i, 

F{xl > n(l + e)} < ^e^"''/^, when < e < n^/^^,n > 16. 

Lemma B.3 Let Y be an n x p matrix with i.i.d. A^(0, 1) entries. Then, for any t > 0, 

F{\\Y\\ >^+^ + t}< e-*'/2. 
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